METHODS AND COMPOSITIONS FOR ELUCIDATING 
PROTEIN EXPRESSION PROFILES IN CELLS 



[0001] This application is a continuation-in-part of U.S. patent application No. 

09/81 1,842 which claims benefit under 35 U.S.C. § 1 19(e) of provisional application No. 

60/190,678 filed Mar. 20, 2000. This application also claims priority under 35 U.S.C. § 
1 19(e) from provisional application 60/458,152 filed March 27, 2003. All of the 
foregoing applications are incorporated herein by reference in their entirety. 

FIELD OF THE INVENTION 

[0002] This invention relates generally to the field of functional genomics. The 

invention enables the direct correlation of genomic DNA to rapidly quantifiable protein 
expression levels enabling the detection of a protein expression profile for a particular 
cell or cell type. This information can then be used to correlate with reference cells to 
identify differences in protein expression patterns that are responsible for differentiation, 
disease states, age, or any other temporal or spatial protein expression difference in 
particular cells for diagnosis, pathway regulation or drug target candidates. 

BACKGROUND OF THE INVENTION 

[0003] The last quarter of a century has been marked by a relentless drive by 

molecular biologists to decipher first genes and then entire genomes. Genomics, the use 
of genetic and molecular biology techniques to develop complete genome maps, as well 
as underlying genomic sequences for different organisms, has provided an explosion of 
information about the underlying genes which make up all living things. The fruits of this 
work already include the genome sequences of 599 viruses and viroids, 205 naturally 
occurring plasmids, 185 organaros, 31 new bacteria, 7 archea, 1 fungus, 2 animals and 1 
plant (Nature, 409:860-921 (2001) The Human Gene Consortium). A significant 
milestone in the field of genomics culminated recently with the announcement that the 
entire human genome had been sequenced. 
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[0004] The most important application of this sequence data, however will be the 

ultimate identification of protein coding genes. Proteins are not produced directly from 
DNA, instead information in the form of DNA is transcribed to form messenger RNA 
(mRNA) molecules. These mRNA molecules function as templates for protein synthesis 
(translation). Each cell in the body contains the entire genome of the organism, however, 
only a portion of any cell's genome is expressed at any given time. Differences in 
expression profiles account for the different types of cells and tissues within an organism 
and for a cell's varying response to stress or disease. 

[0005] Thus, cells in different tissue in the human body are unique because they 

have different native genes. For example, blood cells and muscle cells not only look 
different, they also perform different functions. Blood cells supply oxygen to organs and 
protect us from disease, while muscle cells enable us to move and digest food. These 
differences are due to specific gene products that are unique to blood or muscle cell 
proteins. The presence of different proteins within the same cell is the result of the . 
function of different genes. An example could be the generation of Ab diversity or 
viruses. 

[0006] Functional genomics is aimed at discovering the biological function of 

particular genes and uncovering the means by which sets of genes and their products 
work together in health and disease. According to the Human Gene Consortium group, 
there appear to be about 30,000 to 40,000 protein coding genes in the human genome. 
Amazingly this is only about twice as many as in C. elegans or D. melanogaster the fruit 
fly. Thus the vast complexity of the human must be due to more complicated use of the 
existing genes with alternative splicing rather than simply increased number of genes. 

[0007] If genes encode multiple proteins, then the architect of biological 

complexity distinguishing our genetic material from that of a worm is RNA, the molecule 
that directs the production of proteins from DNA. Unlike genes in bacteria, genes in plant 
and animal cells are not arranged as continuous DNA but as coding exons interspersed 
with noncoding introns making it possible to transcribe one gene into several different 
products as each mRNA is spliced together to form combinations of exons and bits and 
pieces of introns. 
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[0008] Previous estimates were that around 20% of human genes are transcribed 

in more than one alternative variant, but recent research puts the number closer to 50% 
and even this estimate has been criticized as conservative. For example, a team of 
American researchers studying genes that control brain development in Drosophila 
melanogaster reviewed calculations indicating that the Neurexin genes can give rise to 
35,000 different possible protein products just from alternative splicing. If you add the 
possibilities for RNA editing as well as post translational modifications, one could 
potentially end up with millions of different gene products. In fact, studies of fly species 
that have evolved separately for millions of years show that sequence of many alternative 
splice sites are strictly conserved indicating that they are in fact used. 

[0009] Thus the desired endpoint for the description of a biological system is not 

the analysis of mRNA transcript levels alone but also the accurate measurement of 
protein expression levels and their respective activities. Quantitative analysis of global 
mRNA levels is the current method for the analysis of the state of cells and tissues, 
(Fraser, et al, 1997 "Strategies for whole microbial genome sequencing and analysis" 
Electrophoresis 18:1207-1216). Several methods have been refined to provide absolute 
mRNA or relative mRNA levels in comparative analysis. mRNA based genomics, 
however provide several inherent limitations. For example, gene (mRNA) expression 
levels may not always accurately predict the protein expression levels. Therefore gene 
expression analysis such as with micro arrays may not provide definitive information on 
certain targets. In fact Gygi et al., recently concluded that the correlation for all yeast 
proteins between mRNA and protein expression levels was less than 0.4. Indeed, for 
some genes, while the mRNA levels were of the same value the protein levels varied by 
more than 20-fold. Conversely, invariant steady-state levels of certain proteins were 
observed with respective mRNA transcript levels that varied by as much as 30- fold. Gygi 
et al. "Correlation between Protein and mRNA Abundance in Yeast" Molecular and 
Cellular Biology March 1999 pp 1720-1720. 

[0010] Further, post translational modification of proteins such as proteolytic 

cleavage, glycosylation, phosphorylation, prenylation, myristoylation, ubiquitination and 
N- and C-terminal processing can affect protein activity and half life. These 
modifications cannot be determined solely from gene sequence or expression data. Some 
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proteins are active only when they are complexed to other molecules or proteins, or at a 
particular sub-cellular location within a cell. Again these factors cannot be determined 
from gene sequence expression data. FIG. 20 is a diagram demonstrating the layers of 
information which may be assayed to identify the real state of cell (furthest outward 
circle). Those who assay DNA and raw sequence data determine gene function based on 
sequence similarity, gene structure, and evolutionary relationships. Missing from this 
data is any mRNA or translational modification data. Those who assay mRNA gain a 
prediction of a protein profile based on the assumption that protein levels are directly 
proportional to mRNA. An assumption which is proving to be erroneous. Closest of all 
these methods to the real cell state is the method of the invention which detects actual 
cellular protein levels by direct measurement. 

[0011] The field of study of proteomics has gained increasing importance as 

functional genomics attempts to assign functions to the mass of information from the 
human genome. Proteomics includes the science and processes of analyzing and 
cataloging all the proteins encoded by a genome (a proteome). 

[0012] Complete descriptions of proteins including sequence structure and 

function will substantially aid the current pharmaceutical approach to therapeutics 
development. Thus the specific structural and functional aspects of a particular protein 
can be used to design better proteins or small molecule ligands that can serve as 
activators or inhibitors of protein function to develop drugs. Genome sequence 
information, due to the multitude of steps between gene transcription and corresponding 
protein function, is often insufficient to explain disease mechanisms. 

[0013] Multiple genes may be involved in a single disease process. Identifying all 

the genes involved in a particular disease based on DNA sequence data may be possible 
but learning how these genes function in health and disease (and health therapeutic 
interventions can be designed for them) requires proteomics. 

[0014] Disease may be caused by changes in gene expression, protein expression, 

or post translational modification of proteins. Many proteins are the intermediate targets 
for drugs, drug related changes in gene expression levels, or an indirect result of the 
drugs interaction with the protein. Cells and their proteomes are dynamic. One genome 
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may yield multiple proteomes as a result of changes in differentiation, stress, or disease 
condition. Proteomics can be used to determine serum based biomarkers which can be 
valuable as clinical markers or used as a basis of a diagnostic. 

[0015] As can be seen, a need exists in the art for identifying proteins and their 

concomitant coding sequences that are directly or indirectly regulated and involved in 
differential expression patterns associated with disease states, different tissue types, or 
other alternative cell states. 

BRIEF SUMMARY OF THE INVENTION 

[0016] The present invention relates generally to methods and compositions for 

the identification of differential protein expression patterns and concomitantly the active 
genetic regions that are directly or indirectly involved in different cell types, tissue types, 
disease states, or other cellular differences desirable for diagnosis or for drug therapy 
targets. 

[0017] According to the invention a method for obtaining a protein profile in a 

cell is disclosed by use of a genetic integration polynucleotide encoding a tag protein 
which may be actively detected. The polynucleotide construct comprises a marker gene 
or tag which is introduced into the genome of an organism using any vector insertion 
method known in the art, developed in the future, or described herein. The marker gene is 
not operably linked to any promoter sequence in the construct (promoterless) and the 
construct thus relies upon integration within an active transcription unit within the cell for 
expression. The activity of the tag is then measured to sort and preferably quantify 
protein expression patterns for the cell. Once a profile expression pattern is obtained, 
molecular biology techniques are employed to ascertain the particular genetic loci which 
is expressed. This information elucidates diagnostic profiles for disease or other cellular 
states or types as well as elucidating potential target sites for drug intervention and 
alternative gene forms (SNPs). FIG. 3 depicts a general overview of the process of the 
invention as applied to a cancer versus a normal cell. 

[0018] Polynucleotides for achieving the methods of the invention are disclosed 

including expression constructs, molecular biology techniques, transformed cells, vectors, 
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and methods of design of the same which are intended to be within the scope of the 
invention. 

[0019] It is an object of the present invention to provide an immediate linking of 

protein information to its corresponding genome sequence to provide information for 
diagnostic protocols, pathway elucidation, or targets for drug design. 

[0020] It is another object of the present invention to identify a protein expression 

profile of any particular cell, whether plant, bacterial, animal, etc. in origin, and to 
quantify relative levels of expression of those proteins associated with a particular 
population. 

[0021] It is yet another object of the present invention to provide a library of 

functional genomic data that may be used to develop human therapeutics. Most 
researchers involved in the field of functional genomics rely on machine-based analysis 
for protein structure or function to assign functions to proteins. Those approaching the 
task from a sequencing objective assign function by analyzing and comparing genomic 
data developed by comparisons between disease and normal tissues. Typically this is 
accomplished through the use of gene chips or direct sequencing. 

[0022] It is yet another object of the invention to provide an immediate link 

between genomic sequences to proteomic information using molecular biology 
techniques. Results of the information according to the invention can provide new 
therapeutic target development. Individual variations in protein expression levels between 
normal and aberrant tissues will lead to the direct identification of new therapeutic 
targets. If an unidentified protein is either higher or lower in expression levels within the 
malignant cells compared to the normal cells, it provides a probable target for further 
study and identifies a potential drug intervention site. Unique protein targets will be 
identified according to the invention. 

[0023] It is yet another object of the invention to provide information about entire 

pathways of protein regulation, proving new target development when multiple proteins 
are involved in a particular state. Most cancer and other therapeutic drug development 
focuses on a single protein target at one time. However complex interactions between 
proteins result in the malignant or disease state in almost all cells. According to the 
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invention, applicants method identifies protein expression levels for an entire pathway of 
active protein targets and can evaluate expression of multiple proteins simultaneously 
providing for analysis of pattern of protein co-expression with malignant or disease state 
cells as compared to normal cells. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0024] FIG. 1 is a schematic of a vector useful for the invention. In this example, 

integration of a marker peptide coding sequence can occur either in an intron or exon in 
split genes encoding protein products (inclusive but not limited, e.g. genes without 
introns that encode proteins such as histones etc., or genes encoding physiologically 
active RNAs, e.g., snRNA, scRNA, spliceosome components etc.). For the sake of 
clarity, integration into an intron sequence of a cellular gene encoding a protein is shown. 
Placement of a splicing acceptor (SA) upstream of a marker peptide-encoding sequence 
results in the synthesis of a mRNA encoding a fusion protein that includes the marker 
peptide fused to peptide sequences encoded by upstream exons (occurs when the splice 
donor of the nearest upstream exon (closer to the start of transcription) is reacted to the 
splice donor present in the integrated marker DNA sequence). 

[0025] FIGS. 2A-2K depict diagrams of several variant constructions of 

retroviral vectors which perform certain distinct functions for acquiring different types of 
information in cells. The critical portion is the area located between the 5* and 3 f LTR. 
These expression cassettes would be moved essentially intact between any of the various 
viruses and/or plasmids that we have mentioned. FIG. 2A is a vector for exon acquisition. 
FIG. 2B depicts a vector designed for integration site acquisition. FIG. 2C depicts a 
vector for incorporation of multiple marker genes. FIG. 2D depicts a transfection 
cassette. FIG. 2E depicts a vector for replication compliant virus. FIG. 2F depicts a 
vector for a fusion protein marker for cell pre-separation and FACS analysis. RE (Type 
IIS restriction enzyme site); LTR, (long terminal repeat); CMV IE, (CMV intermediate 
early promoter); NeoR; (neomycin resistant gene); pA, (bovine growth hormone poly-A 
signal); SA, (human y-globin intron #2 splicing acceptor); pA, NeoR, CMV, hrGFP, SA 
are in anti-sense orientation against LTRs. Gag, pol, env, retroviral helper virus. FIG. 2H 
depicts examples of HIV- 1 based vectors that do not have selection markers and have the 
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exon trapping marker hrGFP or AcGFP flanked by either a splice acceptor site only 
(pHSG) or by both splice acceptor and donor signals (pHSGEX, pHS2GEX3, 
pHS2AcGEX3 and pHS3GEX4) which are respectively flanked by flexible linker/s of 
glycine and serines (GC linkers) on the amino terminus only (pHSGEX) or on both 
amino and carboxyl terminus of the markers hrGFP (pHS2GEX3) or AcGFP 
(pHS2AcGEX3, pHS3AcGEX4). Vector pHS3AcGEX4 contains an ampicillin 
resistance gene and a pUC bacterial origin of replication, that allows to perform rescue of 
the genomic sequence tags flanking the 3'LTR chromosomal insertion site by digestion 
with Rsal and self-ligation of chromosomal DNA followed by transformation into 
bacteria and selection for Amp resistance. FIG. 21 shows HIV-1 based vectors that in 
addition to the gene trapping exon markers hrGFP or AcGFP flanked by GC linkers, 
contains an expression cassette for selection of transduced cells. The expression cassettes 
consist of either a PGK promoter or an adenovirus Elb promoter driving the expression of 
either murine a(l,3) galactosyl transferase (ccGal) or neomycin phosphotransferase (Neo) 
genes. Expression of oGal results in the expression of the a-galactosyl epitopes on 
glycoproteins and glycolipids present on the external cell surface, which allows to 
magnetically select for cells showing expression of this gene by using antibodies against 
a-galactosyl residue, which is normally absent in untransduced human cells. The aGal 
and Neo genes are cloned in a poly- A trapping configuration to select for those insertion 
events that occur within transcriptional units. Vectors pHS3 AcGEX4PA, 
pHS3AcGEX4ElbA and pHS3AcGEX4ElbNeo contain an ampicillin resistance gene 
and a pUC bacterial origin of replication that allows rescuing the genomic insertion sites 
flanking the 3 'SIN LTR as described above. FIG. 2 J shows four vectors that contain a 
gene trapping exon encoding three copies of the influenza haemaglutinin HA epitopes. 
Vectors pHS2HA3XfO, pHS2HA3Xfl and pHS2HA3Xf2 encode three copies of the HA 
epitope in translational frames 0, 1 and 2, respectively. Vector pHS2HA3F encodes three 
copies of the HA epitope, each one in a different translational frame. This allows for 
magnetic selection of cells showing gene trapping events that result in membrane proteins 
displaying the HA epitope in the external surface of the cell membrane. FIG. 2K depicts 
examples of vectors based on MoMLV backbones where the translation frame of the 
fluorescent marker protein is shifted by zero, one or two nucleotides inserted after the 
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splice acceptor signal. More structural features of these and other vectors are described in 
Table 1. 

[0026] FIG. 3 delivers a rudimentary overview of the process of the invention. 

The process begins with at least two different populations of cells to be compared. Each 
population of cells to be compared will have been marked genetically by a vector 
containing marker/s-peptides to facilitate detection and determination of relative 
concentration of marker/s. Left portion of middle panel demonstrates separation of 
populations of cells based on relative amount of marker present in the tagged cells. 
Sequences flanking the vector will be determined by but not limited to 5' Serial Analysis 
of Viral Integration (5'SAVI), Serial Analysis of Viral Integration (SAVI) or inverse 
PCR procedure for recovering genomic tags associated to vector or vial integration 
events. Valid tags will then be compared to public and commercial databases and 
annotated into our own data bases. 

[0027] FIG. 4 is a depiction of a gene trap vector, pGT5 A with a humanized 

renilla fluorescence protein (hrGFP) as an assay marker, or reporter gene. (A) Schematic 
diagram of pGT5 A plasmid. LTR, long terminal repeat; PBS, retroviral primer binding 
site; CMV IE, CMV intermediate early promoter; NeoR; neomycin resistant gene; pA, 
bovine growth hormone poly-A signal; SA, human y-globin intron #2 splicing acceptor; 
AmpR, ampicillin-resistant gene for bacterial cloning. pA, NeoR, CMV, hrGFP, SA are 
in anti-sense orientation against LTRs. (B) Schematic order of genes in pGT5 A vector. 

[0028] FIG. 5 is a depiction of a vector, pGT5 AH with a humanized renilla 

fluorescence protein (hrGFP) as an assay marker, or reporter gene. (A) Schematic 
diagram of pGT5AH plasmid. LTR, long terminal repeat; PBS, retroviral primer binding 
site; CMV IE, CMV intermediate early promoter; NeoR; neomycin resistant gene; pA, 
bovine growth hormone poly-A signal; SA, human y-globin intron#2 splicing acceptor; 
AmpR, ampicillin-resistant gene for bacterial cloning. pA, NeoR, CMV, hrGFP, SA are 
in anti-sense orientation against LTRs. His6 tag contains 6 continuous histidine residue at 
c-terminal of hrGFP for detection by anti-His6 antibody. (B) Schematic order of genes in 
pGT5 AH vector. 
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[0029] FIG. 6 is a depiction of pGT5Z with a humanized renilla fluorescence 

protein (hrGFP)) as an assay marker, or reporter gene and Zeocin-resi stance gene (ZeoR). 
(A) Schematic diagram of pGT5Z plasmid. LTR, long terminal repeat; PBS, retroviral 
primer binding site; CMV IE, CMV intermediate early promoter; NeoR; neomycin 
resistant gene; pA, bovine growth hormone poly- A signal; SA, human y-globin intron#2 
splicing acceptor; SD, synthetic splicing donor. SV40, simian virus type 40 early 
promoter. AmpR, ampicillin-resistant gene for bacterial cloning. pA, NeoR, CMV, 
hrGFP, SA are in anti-sense orientation against LTRs. (B) Schematic order of genes in 
pGT5Z vector. 

[0030] FIG. 7 is a depiction of a demonstration of the splicing function and fusion 

hrGFP protein expressed by pGT5A vector. (A) A construct of pGT5Z, which derived 
from pGT5 A with an insertion of a SV40 early promoter (S V40), Zeocin-resistant gene 
(ZeoR), and a synthetic splicing donor and partial intron to demonstrate the expected 
biological functions of pGT5A after gene trapping. (B) pGT5Z-transfected cells after 
Zeocin selection showed significant Zeocin-hrGFP fusion protein expression by FACS 
analysis. 

[0031] FIG. 8 is a depiction of a gene trapping of PGT5A-transfected PA317 

cells. (A) PA31 7 cells transfected with pGT5A showed a 3.6% of hrGFP-positive cell 
population. (B) Sorting of the hrGFP-positive cell population in (A) by FACS cell sorter, 
hrGFP-positive population was enriched to 95% after 2 weeks of cell culture. 

[0032] FIG. 9 is a depiction of gene expression of hrGFP in gene trapped PA3 1 7 

cells. RT-PCR was performed on total RNA extracted from sorted cells in FIG. 7 and 
FIG. 8, and PCR product was electrophoresed in 2% agarose gel. The whole length of 
hrGFP transcripts driven by trapped cellular promoter (GT5A/PA317) were amplified by 
hrGFP specific primers after cDNA synthesis as indicated with an arrow. Transcripts 
from GT5Z in PA317 (GT5Z/PA317) and PA317 without vector (PA317) were used as a 
positive and negative control. 

[0033] FIG. 10 is a depiction of gene trapping of GT5A vector in human lung 

cancer cells, A549, after viral transduction. (A) A549 cells without transduction analyzed 
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by FACS. (B) A549 cells with GT5 A-transduction analyzed by FACS showed the 
hrGFP-positive population is 1.68% after gene trapping. 

[0034] FIG. 1 1 is a depiction of gene trapping of GT5 A vector in NIH3T3 cells. 

Mixed population of GT5A-trapped NIH3T3 cells were sorted and cultured for three 
weeks and then analyzed by FACS comparing to untransduced cells. Different intensities 
of hrGFP were shown in four different major groups. 

[0035] FIG. 12 is a depiction of hrGFP gene expression of single-cell clones from 

GT5A-trapped NIH3T3 cells. Individual single-cells were sorted into 96-wells plate and 
cultured to a sufficient population for FACS analysis. A6P1 and C4P2, C8P2 and H8P2 
were analyzed at two different events while compared to untransduced NIH3T3 cells. 

[0036] FIG. 13A-D is a depiction of gene trapping with an al,3-galactosyl 

transferase as a reporter gene in human melanoma cell line, A375. (13A) Schematic 
diagram of serial gene trapping vectors with a 1,3 -galactosyl transferase (al,3-gal) gene. 
cd,3 galactosyl transferase is an enzyme that will generate galactosylated products only in* , 
the Golgi, by using its gene as a marker for gene trapping it will select for those gene 
trapping events that generate fusions between al,3 galactosyl transferase and proteins 
that have a Golgi localization signals. LTR, long terminal repeat; SV40, simian virus 
type 40 early promoter; ZeoR, Zeocin resistant gene; CMV, CMV early promoter; NeoR; 
neomycin resistant gene; pA, bovine growth hormone poly-A signal. SA, human g-globin 
intron 2 splicing acceptor; SD, synthetic splicing donor. pA, NeoR, CMV, a 1,3 gal, SA 
or SD, ZeoR and SV40 are in anti-sense orientation against LTRs. (13B) Gene trapping 
of pGT7A in A375/AMIZ cells. Cells were labeled with lectin conjugated with FITC for 
FACS analysis. Lectin binds to al,3 gal epitopes on cell surface to show successful 
gene-trapping. (13C) Gene trapping in A375/AMIZ cells 3 days post transfection of 
pGT7AH. (13D) Splicing function and functional a- 1,3 a-gal/ZeoR fusion protein were 
demonstrated by lectin/FITC-positive cells. 

[0037] FIG. 14A - 14D show representative examples of cell sorting by FACS 

into different gates according to expression levels of the fluorescent reporter gene. In 
these examples, a normal (HMEC, FIGS. 14A and 14C) and a cancer (MCF7, Figs 14C 
and 14D) breast cell lines were transduced with retroviral vectors HSG (Figs 14C and 
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14D) or pGT-FSO (Figs 14A and 14B). Cells were first enriched by a first round of cell 
sorting by FACS (second panel from the top in every FIG.) and then were subsequently 
sorted into 4 gates with different levels of expression of the fluorescent marker protein. 

[0038] FIG. 15 is a schematic depicting a vector of the invention which utilizes 

homologous recombination as the integration strategy. The repeat sequences are 
engineered to flank the assay marker gene and then introduced to the cell. 

[0039] FIG. 16 is a diagram depicting the concept of frame alignment. Only 1 in 

3 integrants will be in frame, based upon the triplet codon scheme so that only 1 in three 
integrated vectors will be functional and result in translation of the assay marker. 

[0040] FIG. 17 is a schematic depicting the inverse PCR procedure for recovering 

genomic tags associated to vector or viral integration events. A method of cleaving said 
cellular DNA such that inserted DNA (with sequence known to the operator) is cleaved 
once and flanking cellular DNA of unknown sequence is cleaved again in the regions 
contiguous to the inserted piece of DNA. Cleavage of the DNA occurs in a fashion . 
generating ends that permit the circularization of DNA fragments producing a molecule 
with the sequence known to the operator flanking both sides, and continuous with, a 
variable length of cellular DNA of unknown sequence. The region containing the 
unknown DNA is then amplified and sequenced. 

[0041] FIG. 18A-B is a schematic depicting one of the possible experimental 

procedures to carry out the method denominated 5' Serial Analysis of Viral Integration 
(5 5 S AVI). After the S A joins to the splicing donor (SD) of the integrated cellular gene 
by cellular splicing mechanism, reverse transcription is employed to convert this hybrid 
RNA transcript into a complementary double stranded cDNA (cDNA) following standard 
methods for full length total cDNA synthesis. This cDNA is then subjected to restriction 
enzyme digestion with a Type IIS restriction enzyme which will cut the cDNA into the 
sequences corresponding to the cellular exon ten to twenty bases away from the SD/SA 
junction depending on which Type IIS restriction enzyme is used. A biotin-labeled 
primer #1 with a sequence specific for the marker gene is then employed to generate a 
ssDNA fragment that extends into the cellular exon fused to the marker gene. Collection 
of this biotin-ssDNA by streptavidin conjugated magnetic beads enriches these specific 
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ssDNA for subsequent DNA terminal transferase reaction. Poly-deoxynucleotide can be 
added onto these ssDNA as a tail at their 3' end. An oligonucleotide primer 
complementary to the polymer tail and a second primer #2 nested with respect to primer 
#1 on the marker gene can therefore be used to amplify by PCR this 3' end of the cellular 
exon fused to the 5 'side of the marker exon. These short tags or amplification fragments 
from different integrated genes can, by ligation reactions, be made into longer DNA 
fragments that are subsequently sequenced. 

[0042] FIG. 1 8C-D is a schematic depicting another possible experimental 

approach to carry out the 5' SAVI method. In this second version of the 5'SAVI method, 
a biotinylated primer #1 specific to the complementary sequence of the marker exon is 
used to prime cDNA synthesis by reverse transcriptase. Next, a polynucleotide tail is 
added to this single stranded cDNA by the enzyme terminal transferase. An 
oligonucleotide primer complementary to this homopolymeric tail is then used to drive 
the synthesis of the complementary second DNA strand. Double stranded DNA products 
are subjected to a type IIS restriction enzyme digestion and the digestion products are 
purified with magnetic streptavidin beads. An adaptor is ligated to the end generated by 
the type IIS restriction enzyme and the products are amplified by PCR with a primer 
corresponding to the adaptor sequence and with a primer #2 specific to the marker gene 
and nested with respect to primer #1. Amplification products are ligated together into 
high order polymeric structures, cloned into sequencing vectors and sequenced. 

[0043] Figure 1 8 E-F depicts the method of 3 'SAVI. In this case, the marker exon 

contains a type IIS restriction enzyme at the 3' end of the exon sequence immediately 
followed by a splice donor consensus sequence. After mRNA purification and double 
stranded cDNA synthesis and digestion with a type IIS restriction enzyme, a double 
stranded DNA adaptor can be ligated to the end generated by the type IIS restriction 
enzyme. Amplification of fragments containing marker sequences can be accomplished 
by PCR amplification using a primer #1 specific to the marker exon sequence and a 
second primer corresponding to the ligated adaptor. After PCR amplification, fragments 
of equal length can be cloned into high order polymeric structures, cloned into 
sequencing vectors and sequenced. 
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[0044] FIG. 18 G-H is a schematic of the Serial Analysis of Viral Integration 

(SAVI) method that permits the simultaneous identification of cellular exon boundaries. 
The method starts by isolating mRNA from the cell, and generating full length double 
stranded cDNA by reverse transcription. The cDNA is subjected to a Type IIS restriction 
enzyme (RE) that recognizes the first and second Type IIS RER sites, cleaving the cDNA 
upstream of the first Type IIS RER site and downstream of the second Type IIS RER site. 
Then, the fragment is self-ligated generating a circular molecule, where the sequence tags 
from the upstream and downstream cellular exons are fused in inverse orientation 
generating a di-tag. This di-tag is then amplified by inverse PCR using marker-specific 
primers. Following amplification, the fragments are subjected to digestion with the 
restriction enzymes that cut at the non-Type IIS REs, separation and ligated together to 
form a concatamer, which is then sequenced by appropriate methods. 

[0045] FIG. 19 is a non-limiting flow diagram demonstrating the entire process. 

This FIG. delivers a rudimentary overview of the process of the invention. The process 
begins with two different populations of cells to be compared. Each population of cells to 
be compared will have been marked genetically by a vector containing marker/s-peptides 
to facilitate detection and determination of relative concentration of marker/s. Left 
portion of middle panel demonstrates separation of populations of cells based on relative 
amount of marker present in the tagged cells. Sequences flanking the vector will be 
determined by but not limited to SAVI method or an inverse PCR procedure for 
recovering genomic tags associated with vector or viral integration events. Valid tags 
will then be compared to public and commercial data bases and annotated into our own 
data bases. As can be seen at each stage alternatives exist for each step. 

[0046] FIG. 20 is a diagram demonstrating the layers of information which may 

be assayed to identify the real state of cell (furthest outward circle). Those who assay 
DNA and raw sequence data determine gene function based on sequence similarity, gene 
structure, and evolutionary relationships. Missing from this data is any mRNA or 
translational modification data. Those who assay mRNA gain a prediction of a protein 
profile based on the assumption that protein levels are directly proportional to mRNA. An 
assumption which is proving to be erroneous. Closest of all these methods to the real cell 
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state is the method of the invention which detects actual cellular protein levels by direct 
measurement. 

[0047] FIG. 21 is a depiction of a successful gene trapping in pGT5A-transfected 

PA317 cells. Ncol restriction site located at the 5 f end of hrGFP marker gene and an 
EcoRI at the Oligo-dA primer were used as cloning sites for gene trapped sequence into a 
sequencing vector which was digested with Ncol and EcoRI. After BLAST searching 
against mouse EST database in GenBank, the sequence trapped by pGT5 A demonstrates 
99% homology to a high mobility group protein, HMGI-C, a nuclear phosphoprotein that 
contains three short DNA-binding domains (AT-hooks) and a highly acidic C-terminus. 

[0048] FIG. 22 is a depiction of gene trapping of an exon with unknown 

biological function in pGT5A-transfected PA317 cells. Ncol restriction site located at the 
5' end of hrGFP marker gene and an EcoRI at the oligo-dA primer were used as cloning 
sites for gene trapped sequence into a sequencing vector which was digested with Ncol 
and EcoRI. After BLAST searching against the EST database in GenBank, the sequence 
trapped by pGT5A is 95% match to a NCI_CGAP_Li9 Mus musculus cDNA clones, 
BF539247.1/BF533319.1/ which have been found in the cDNA libraries from Salivary 
gland and liver. 

[0049] FIG. 23 Compilations of genes identified by this technology, classified 

according to their subcellular localization. 

[0050] FIG. 24 Compilations of genes identified by this technology, classified 

according to their functional role. 

[0051] FIG. 25A-B Illustrate the utilization of this invention to screen for cellular 

protein domains that interact with a target protein of interest. As shown in FIG. 25 A, a 
stable cell line expressing a fusion between the target protein (bait protein) fused to one 
of the subunits of the reporter system is constructed. Gene trapping is performed on this 
stable cell line using retroviral vectors that encode the second subunit of the reporter 
system in an exon acceptor configuration. As shown in FIG. 25B, this generates a library 
of gene trapped protein domains fused to the second subunit of the reporter system. Upon 
interaction of the protein domains fused to each subunit of the reporter system, the 
reporter systems gains functionality allowing to generate a fluorescent signal either by 
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fluorescence resonance energy transfer (FRET, for CFP/YFP system) or by enzymatic 
conversion of a pro-fluorescent substrate (cx/co beta lactamase split system, PNAS 2002, 
99:3469; Nat. Biotech. 2002, 20:619). 

DETAILED DESCRIPTION OF THE INVENTION 

Definitions 

[0052] Unless defined otherwise, all technical and scientific terms used herein 

have the same meaning as commonly understood by one of ordinary skill in the art to 
which this invention belongs. Generally, the nomenclature used herein and the laboratory 
procedures in cell culture, molecular genetics, and nucleic acid chemistry and 
hybridization described below are those well known and commonly employed in the art. 
Standard techniques are used for recombinant nucleic acid methods, polynucleotide 
synthesis, and microbial culture and transformation (e.g., electroporation, lipofection). 
Generally, enzymatic reactions and purification steps are performed according to the 
manufacturer's specifications. The techniques and procedures are generally performed 
according to conventional methods in the art and various general references (see 
generally, Sambrook et al. Molecular Cloning: A Laboratory Manual, 2d ed. (1989) Cold 
Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., which is incorporated herein 
by reference) which are provided throughout this document. Units, prefixes, and symbols 
may be denoted in their SI accepted form. Unless otherwise indicated, nucleic acids are 
written left to right in 5' to 3 f orientation; amino acid sequences are written left to right in 
amino to carboxyl orientation, respectively. Numeric ranges are inclusive of the numbers 
defining the range and include each integer within the defined range. Amino acids may 
be referred to herein by either their commonly known three letter symbols or by the one- 
letter symbols recommended by the IUPAC-IUB Biochemical nomenclature 
Commission. Nucleotides, likewise, may be referred to by their commonly accepted 
single-letter codes. Unless otherwise provided for, software, electrical, and electronics 
terms as used herein are as defined in The New IEEE Standard Dictionary of Electrical 
and Electronics Terms (5 th edition, 1993). As employed throughout the disclosure, the 
following terms, unless otherwise indicated, shall be understood to have the following 
meanings and are more fully defined by reference to the specification as a whole: 
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[0053] By "amplified" is meant the construction of multiple copies of a nucleic 

acid sequence or multiple copies complementary to the nucleic acid sequence using at 
least one of the nucleic acid sequences as a template. Amplification systems include the 
polymerase chain reaction (PCR) system, ligase chain reaction (LCR) system, nucleic 
acid sequence based amplification (NASBA, Canteen, Mississauga, Ontario), Q-Beta 
Replicase systems, transcription-based amplification system (TAS), and strand 
displacement amplification (SDA) See, e.g., Diagnostic Molecular Microbiology: 
Principles and Applications, D. H. Persing et al., Ed., American Society for 
Microbiology, Washington, D.C. (1993). The product of amplification is termed an 
amplicon. 

[0054] The term "antibody" includes reference to antigen binding forms of 

antibodies (e.g., Fab, F(ab).sub.2). The term "antibody" frequently refers to a polypeptide 
substantially encoded by an immunoglobulin gene or immunoglobulin genes, or 
fragments thereof which specifically bind and recognize an analyte (antigen). However, 
while various antibody fragments can be defined in terms of the digestion of an intact 
antibody, one of skill will appreciate that such fragments may be synthesized de novo , 
either chemically or by utilizing recombinant DNA methodology. Thus, the term 
antibody, as used herein, also includes antibody fragments such as single chain Fv, 
chimeric antibodies (i.e., comprising constant and variable regions from different 
species), humanized antibodies (i.e., comprising a complementarity determining region 
(CDR) from a non-human source) and heteroconjugate antibodies (e.g., bi-specific 
antibodies). 

[0055] The term "assay marker" or "reporter" refers to a gene product that can be 

detected in experimental assay protocol, such as marker enzymes, antigens, amino acid 
sequence markers, cellular phenotypic markers, nucleic acid sequence markers, and the 
like. 

[0056] The term "assaying for the expression" of a protein coding sequence 

means any test or series of tests that permits cells expressing the protein to be 
distinguished from those that do not express the protein. Such tests include biochemical 
and biological tests and use either "selectable markers" or "assay markers." 



- 17- 



[0057] As used herein, "chromosomal region 11 includes reference to a length of a 

chromosome that may be measured by reference to the linear segment of DNA that it 
comprises. The chromosomal region can be defined by reference to two unique DNA 
sequences, i.e., markers. 

[0058] A "cloning vector" is a DNA molecule such as a plasmid, cosmid, or 

bacterial phage that has the capability of replicating autonomously in a host cell. Cloning 
vectors typically contain one or a small number of restriction endonuclease recognition 
sites at which foreign DNA sequences can be inserted in a determinable fashion without 
loss of essential biological function of the vector, as well as a selectable marker gene that 
is suitable for use in the identification and selection of cells transformed with the cloning 
vector. Selectable marker genes typically include genes that provide tetracycline 
resistance or ampicillin resistance. 

[0059] The term "conservatively modified variants" applies to both amino acid 

and nucleic acid sequences. With respect to particular nucleic acid sequences, 
conservatively modified variants refer to those nucleic acids which encode identical or 
conservatively modified variants of the amino acid sequences. Because of the degeneracy 
of the genetic code, a large number of functionally identical nucleic acids encode any 
given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino 
acid alanine. Thus, at every position where an alanine is specified by a codon, the codon 
can be altered to any of the corresponding codons described without altering the encoded 
polypeptide. Such nucleic acid variations are "silent variations" and represent one species 
of conservatively modified variation. Every nucleic acid sequence herein that encodes a 
polypeptide also, by reference to the genetic code, describes every possible silent 
variation of the nucleic acid. One of ordinary skill will recognize that each codon in a 
nucleic acid (except AUG, which is ordinarily the only codon for methionine; and UGG, 
which is ordinarily the only codon for tryptophan) can be modified to yield a functionally 
identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a 
polypeptide of the present invention is implicit in each described polypeptide sequence 
and is within the scope of the present invention. 
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[0060] As to amino acid sequences, one of skill will recognize that individual 

substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein 
sequence which alters, adds or deletes a single amino acid or a small percentage of amino 
acids in the encoded sequence is a "conservatively modified variant" where the alteration 
results in the substitution of an amino acid with a chemically similar amino acid. Thus, 
any number of amino acid residues selected from the group of integers consisting of from 
1 to 15 can be so altered. Thus, for example, 1, 2, 3, 4, 5, 7, or 10 alterations can be 
made. Conservatively modified variants typically provide similar biological activity as 
the unmodified polypeptide sequence from which they are derived. For example, 
substrate specificity, enzyme activity, or ligand/receptor binding is generally at least 
30%, 40%, 50%, 60%, 70%>, 80%, or 90% of the native protein for its native substrate. 
Conservative substitution tables providing functionally similar amino acids are well 
known in the art. 

[0061] The following six groups each contain amino acids that are conservative 

substitutions for one another: 

1) Alanine (A), Serine (S), Threonine (T); 

2) Aspartic acid (D), Glutamic acid (E); 

3) Asparagine (N), Glutamirie (Q); 

4) Arginine (R), Lysine (K); 

5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); and 

6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W) 

[0062] See also, Creighton (1984) Proteins W. H. Freeman and Company. The 

term "detectable marker" encompasses both the selectable markers and assay markers. 
The term "selectable markers" refers to a variety of gene products to which cells 
transformed with an expression construct can be selected or screened, including drug- 
resistance markers, antigenic markers useful in fluorescence-activated cell sorting, 
adherence markers such as receptors for adherence ligands allowing selective adherence, 
and the like. 
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[0063] By "encoding" or "encoded", with respect to a specified nucleic acid, is 

meant comprising the information for translation into the specified protein. A nucleic 
acid encoding a protein may comprise non-translated sequences (e.g., introns) within 
translated regions of the nucleic acid, or may lack such intervening non-translated 
sequences (e.g., as in cDNA). The information by which a protein is encoded is specified 
by the use of codons. Typically, the amino acid sequence is encoded by the nucleic acid 
using the "universal" genetic code. However, variants of the universal code, such as are 
present in some plant, animal, and fungal mitochondria, the bacterium Mycoplasma 
capricolum, or the ciliate Macronucleus, may be used when the nucleic acid is expressed 
therein. 

[0064] When the nucleic acid is prepared or altered synthetically, advantage can 

be taken of known codon preferences of the intended host where the nucleic acid is to be 
expressed. 

[0065] An "expression vector" is a DNA molecule comprising a gene that is 

expressed in a host cell. Typically, gene expression is placed under the control of certain 
regulatory elements including promoters, tissue specific regulatory elements, and 
enhancers. Such a gene is said to be "operably linked to" the regulatory elements. 

[0066] The term "expression system" is used herein to refer to a genetic sequence 

which includes a protein encoding region which is operably linked to all of the genetic 
signals necessary to achieve expression of the protein encoding region. Traditionally, the 
expression system will include a regulatory element such as a promoter or enhancer, to 
increase transcription and/or translation of the protein encoding region, or to provide 
control over expression. The regulatory element may be located upstream or downstream 
of the protein encoding region, or may be located at an intron (non coding portion) 
interrupting the protein encoding region. Alternatively it is also possible for the sequence 
of the protein encoding region itself to comprise regulatory ability. 

[0067] The term "functional splice acceptor" refers to any individual functional 

splice acceptor or functional splice acceptor consensus sequence that permits the 
construct of the invention to be processed such that it is included in any mature, 
biologically active mRNA, provided that it is integrated in an active chromosomal locus 
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and transcribed as a contiguous part of the pre-messenger RNA of the chromosomal 
locus. 

[0068] As used herein, "heterologous" in reference to a nucleic acid is a nucleic 

acid that originates from a foreign species, or, if from the same species, is substantially 
modified from its native form in composition and/or genomic locus by deliberate human 
intervention. For example, a promoter operably linked to a heterologous structural gene is 
from a species different from that from which the structural gene was derived, or, if from 
the same species, one or both are substantially modified from their original form. A 
heterologous protein may originate from a foreign species or, if from the same species, is 
substantially modified from its original form by deliberate human intervention. 

[0069] The term "host cell" encompasses any cell which contains a vector and 

preferably supports the replication and/or expression of the vector. Host cells may be 
prokaryotic cells such as E. coli, or eukaryotic cells such as yeast, insect, amphibian, or 
mammalian cells. The term as used herein means any cell which may be in culture or in 
vivo as part of a unicellular organism, part of a multicellular organism, or a fused or 
engineered cell culture. 

[0070] The term "internal ribosome entry site" (IRES) is an element which 

permits attachment of a downstream coding region or open reading frame with a 
cytoplasmic polysomal ribosome for purposes of initiating translation thereof in the 
absence of any internal promoters. An IRES is included to initiate translation of 
selectable marker protein coding sequences. Examples of suitable IRESes that can be 
used include the mammalian IRES of the immunoglobulin heavy-chain-binding protein 
(BiP). Other suitable IRESes are those from the picornaviruses. For example, such 
IRESes include those from encephalomyocarditis virus (preferably nucleotide numbers 
163-746), poliovirus (preferably nucleotide numbers 28-640) and foot and mouth disease 
virus (preferably nucleotide numbers 369-804). Thus, the viruses are located in the long 
5 1 untranslated regions of the picornaviruses which can be removed from their viral 
setting in length to unrelated genes to produce polycistronic mRNAs. 

[0071] The term "introduced" in the context of inserting a nucleic acid into a cell, 

means "transfection" or "transformation" or "transduction" and includes reference to the 
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incorporation of a nucleic acid into a eukaryotic or prokaryotic cell where the nucleic 
acid may be incorporated into the genome of the cell (e.g., chromosome, plasmid, plastid 
or mitochondrial DNA), converted into an autonomous replicon, or transiently expressed 
(e.g., transfected mRNA). 

[0072] The term "isolated" refers to material, such as a nucleic acid or a protein, 

which is: (1) substantially or essentially free from components that normally accompany 
or interact with it as found in its naturally occurring environment. The isolated material 
optionally comprises material not found with the material in its natural environment; or 
(2) if the material is in its natural environment, the material has been synthetically (non- 
naturally) altered by deliberate human intervention to a composition and/or placed at a 
location in the cell (e.g., genome or subcellular organelle) not native to a material found 
in that environment. The alteration to yield the synthetic material can be performed on 
the material within or removed from its natural state. For example, a naturally occurring 
nucleic acid becomes an isolated nucleic acid if it is altered, or if it is transcribed from 
DNA which has been altered, by means of human intervention performed within the cell 
from which it originates. See, e.g., Compounds and Methods for Site Directed 
Mutagenesis in Eukaryotic Cells, Rmiec, U.S. Pat. No. 5,565,350; In Vivo Homologous 
Sequence Targeting in Eukaryotic Cells; Zarling et al., PCT/US93/03868. Likewise, a 
naturally occurring nucleic acid (e.g., a promoter) becomes isolated if it is introduced by 
non-naturally occurring means to a locus of the genome not native to that nucleic acid. 
Nucleic acids which are "isolated" as defined herein are also referred to as "heterologous" 
nucleic acids. 

[0073] As used herein, "nucleic acid" includes reference to a deoxyribonucleotide 

or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise 
limited, encompasses known analogues having the essential nature of natural nucleotides 
in that they hybridize to single-stranded nucleic acids in a manner similar to naturally 
occurring nucleotides (e.g., peptide nucleic acids). 

[0074] As used herein "operably linked" includes reference to a functional 

linkage between a promoter and a second sequence, wherein the promoter sequence 
initiates and mediates transcription of the DNA sequence corresponding to the second 
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sequence. Generally, operably linked means that the nucleic acid sequences being linked 
are contiguous and, where necessary to join two protein coding regions, contiguous and 
in the same reading frame. 

[0075] The term "polymerase chain reaction" or "PCR" refers to a procedure 

described in U.S. Pat. No. 4,683,195, the disclosure of which is incorporated herein by 
reference. 

[0076] As used herein, "polynucleotide" includes reference to a 

deoxyribopolynucleotide, ribopolynucleotide, or analogs thereof that have the essential 
nature of a natural ribonucleotide in that they hybridize, under stringent hybridization 
conditions, to substantially the same nucleotide sequence as naturally occurring 
nucleotides and/or allow translation into the same amino acid(s) as the naturally 
occurring nucleotide(s). A polynucleotide can be full-length or a subsequence of a native 
or heterologous structural or regulatory gene. Unless otherwise indicated, the term 
includes reference to the specified sequence as well as the complementary sequence 
thereof. Thus, DNAs or RNAs with backbones modified for stability or for other reasons 
as "polynucleotides" as that term is intended herein. Moreover, DNAs or RNAs 
comprising unusual bases, such as inosine, or modified bases, such as tritylated bases, to 
name just two examples, are polynucleotides as the term is used herein. It will be 
appreciated that a great variety of modifications have been made to DNA and RNA that 
serve many useful purposes known to those of skill in the art. The term polynucleotide as 
it is employed herein embraces such chemically, enzymatically or metabolically modified 
forms of polynucleotides, as well as the chemical forms of DNA and RNA characteristic 
of viruses and cells, including among other things, simple and complex cells. 

[0077] The terms "polypeptide", "peptide" and "protein" are used interchangeably 

herein to refer to a polymer of amino acid residues. The terms apply to amino acid 
polymers in which one or more amino acid residue is an artificial chemical analogue of a 
corresponding naturally occurring amino acid, as well as to naturally occurring amino 
acid polymers. The essential nature of such analogues of naturally occurring amino acids 
is that, when incorporated into a protein that protein is specifically reactive to antibodies 
elicited to the same protein but consisting entirely of naturally occurring amino acids. 
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The terms "polypeptide", "peptide" and "protein" are also inclusive of modifications 
including, but not limited to, glycosylation, lipid attachment, sulfation, y-carboxylation of 
glutamic acid residues, hydroxylation and ADP-ribosylation. It will be appreciated, as is 
well known and as noted above, that polypeptides are not entirely linear. For instance, 
polypeptides may be branched as a result of ubiquitination, and they may be circular, 
with or without branching, generally as a result of postradiational events, including 
natural processing event and events brought about by human manipulation which do not 
occur naturally. Circular, branched and branched circular polypeptides may be 
synthesized by non-translation natural process and by entirely synthetic methods, as well. 

[0078] The term "primer" refers to a nucleic acid which, when hybridized to a 

strand of DNA, is capable of initiating the synthesis of an extension product in the 
presence of a suitable polymerization agent. The primer preferably is sufficiently long to 
hybridize uniquely to a specific region of the DNA strand. 

[0079] As used herein "promoter" includes reference to a region of DNA : 

upstream from the start of transcription and involved in recognition and binding of RNA 
polymerase and other proteins to initiate transcription. 

[0080] The term "promoterless" refers to a protein coding sequence contained in a 

vector, retrovirus, adenovirus, adeno-associated virus or retroviral provirus that is not 
directly or significantly under the control of a promoter within the vector, whether it be in 
RNA or DNA form. The vector, plasmid, viral or otherwise, may contain a promoter, but 
that promoter cannot be positioned or configured such that it directly or significantly 
regulates the expression of the promoterless protein coding sequence. 

[0081] The term "protein coding sequence" means a nucleotide sequence 

encoding a polypeptide gene which can be used to distinguish cells expressing the 
polypeptide gene from those not expressing the polypeptide gene. Protein coding 
sequences include those commonly referred to as selectable markers. Examples of protein 
coding sequences include those coding a cell surface antigen and those encoding 
enzymes. A representative list of protein coding sequences include thymidine kinase, 
.beta.-galactosidase, tryptophan synthetase, neomycin phosphotransferase, histidinol 
dehydrogenase, luciferase, chloramphenicol acetyltransferase, dihydrofolate reductase 
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(DHFR); hypoxanthine guanine phosphoribosyl transferase (HGPRT), CD4, CD8 and 
hygromycin phosphotransferase (HYGRO). 

[0082] As used herein "recombinant" includes reference to a cell or vector, that 

has been modified by the introduction of a heterologous nucleic acid or that the cell is 
derived from a cell so modified. Thus, for example, recombinant cells express genes that 
are not found in identical form within the native (non-recombinant) form of the cell or 
express native genes that are otherwise abnormally expressed, under-expressed or not 
expressed at all as a result of deliberate human intervention. The term "recombinant" as 
used herein does not encompass the alteration of the ce ll or vector by naturally occurring 
events (e.g., spontaneous mutation, natural transformation/transduction/transposition) 
such as those occurring without deliberate human intervention. 

[0083] As used herein, a "recombinant expression cassette" is a nucleic acid 

construct, generated recombinantly or synthetically, with a series of specified nucleic 
acid elements which permit transcription of a particular nucleic acid in a host cell. The 
recombinant expression cassette can be incorporated into a plasmid, chromosome, 
mitochondrial DNA, virus, or nucleic acid fragment. Typically, the recombinant 
expression cassette portion of an expression vector includes, among other sequences, a 
nucleic acid to be transcribed, and a promoter. 

[0084] A "recombinant host" may be any prokaryotic or eukaryotic cell that 

contains either a cloning vector or an expression vector. This term also includes those 
prokaryotic or eukaryotic cells that have been genetically engineered to contain the clone 
genes in the chromosome or genome of the host cell. 

[0085] The terms "recombinant virus vector" refers to any recombinant 

ribonucleic acid molecule having a nucleotide sequence homologous or complementary 
with a nucleotide sequence in an RNA virus that replicates through a DNA intermediate, 
has a virion RNA and utilizes reverse transcriptase for propagation of virus in a host cell. 
Such viruses can include those that require the presence of other viruses, such as helper 
viruses, to be passaged. Thus, retroviral vectors or retroviruses are intended to include 
those containing substantial deletions or mutations in their RNA. 
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[0086] The term "selectively hybridizes" includes reference to hybridization, 

under stringent hybridization conditions, of a nucleic acid sequence to a specified nucleic 
acid target sequence to a detectably greater degree (e.g., at least 2-fold over background) 
than its hybridization to non-target nucleic acid sequences and to the substantial 
exclusion of non-target nucleic acids. Selectively hybridizing sequences typically have 
about at least 80% sequence identity, preferably 90% sequence identity, and most 
preferably 100% sequence identity (i.e., complementary) with each other. 

[0087] The terms "tag" or "tagged" refers to incorporation of a detectable marker, 

e.g., by incorporation of a radiolabeled amino acid or attachment to a polypeptide of 
biotinyl moieties that can be detected by marked avidin (e.g., streptavidin containing a 
fluorescent marker or enzymatic activity that can be detected by optical or calorimetric 
methods). Various methods of labeling polypeptides and glycoproteins are known in the 
art and may be used. Examples of labels for polypeptides include, but are not limited to, 
the following: radioisotopes (e.g., 3 H, 14 C, 35 S, 125 1, 131 I), fluorescent labels (e.g., FITC, 
rhodamine, lanthanide phosphors), enzymatic labels (or reporter genes) (e.g., horseradish 
peroxidase, .beta.-galactosidase, luciferase, alkaline phosphatase), chemiluminescent, 
biotinyl groups, predetermined polypeptide epitopes recognized by a secondary reporter 
(e.g., leucine zipper pair sequences, binding sites for secondary antibodies, metal binding 
domains, epitope tags). In some embodiments, labels are attached by spacer arms of 
various lengths to reduce potential steric hindrance. 

[0088] The term "translational stop sequence" refers to a sequence that code for 

the translational stop codons in three different reading frames. This translational stop 
sequence is physically located downstream (3 1 ) of the splice acceptor sequence, but 
upstream (5*) of the selectable marker fusion protein translation initiation site. It causes 
truncation of the peptide chain encoded by exons upstream of the retroviral vector at the 
chromosomal locus. It also prevents the translational reading frame of the genomic locus 
from proceeding into the selectable marker gene of the invention, thus preventing 
potential translation of it in a non-sense reading frame. 
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[0089] As used herein, "vector" includes reference to a nucleic acid used in 

transfection of a host cell and into which can be inserted a polynucleotide. Vectors are 
often replicons. Expression vectors permit transcription of a nucleic acid inserted therein. 

[0090] The following terms are used to describe the sequence relationships 

between two or more nucleic acids or polynucleotides: (a) "reference sequence", (b) 
"comparison window", (c) "sequence identity", (d) "percentage of sequence identity", and 
(e) "substantial identity". 

[0091] (a) As used herein, "reference sequence" is a defined sequence used as a 

basis for sequence comparison. A reference sequence may be a subset or the entirety of a 
specified sequence; for example, as a segment of a full-length cDNA or gene sequence, 
or the complete cDNA or gene sequence. 

[0092] (b) As used herein, "comparison window" includes reference to a 

contiguous and specified segment of a polynucleotide sequence, wherein the 
polynucleotide sequence may be compared to a reference sequence and wherein the 
portion of the polynucleotide sequence in the comparison window may comprise 

additions or deletions (i.e., gaps) compared to the reference sequence (which does not 

i 

comprise additions or deletions) for optimal alignment of the two sequences. Generally, 
the comparison window is at least 20 contiguous nucleotides in length, and optionally can 
be 30, 40, 50, 100, or longer. Those of skill in the art understand that to avoid a high 
similarity to a reference sequence due to inclusion of gaps in the polynucleotide 
sequence, a gap penalty is typically introduced and is subtracted from the number of 
matches. 

[0093] Methods of alignment of sequences for comparison are well-known in the 

art. Optimal alignment of sequences for comparison may be conducted by the local 
homology algorithm of Smith and Waterman, Adv. Appl. Math. 2:482 (1981); by the 
homology alignment algorithm of Needleman and Wunsch, J. Mol. Biol. 48:443 (1970); 
by the search for similarity method of Pearson and Lipman, Proc. Natl. Acad. Sci. 
85:2444 (1988); by computerized implementations of these algorithms, including, but not 
limited to: CLUSTAL in the PC/Gene program by Intelligenetics, Mountain View, Calif.; 
GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wisconsin Genetics Software 
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Package, Genetics Computer Group (GCG), 575 Science Dr., Madison, Wis., USA; the 
CLUSTAL program is well described by Higgins and Sharp, Gene 73:237-244 (1988); 
Higgins and Sharp, CABIOS 5:151-153 (1989); Corpet, et ah, Nucleic Acids Research 
16:10881-90 (1988); Huang, et al., Computer Applications in the Biosciences 8:155-65 
(1992), and Pearson, et al., Methods in Molecular Biology 24:307-331 (1994). The 
BLAST family of programs which can be used for database similarity searches includes: 
BLASTN for nucleotide query sequences against nucleotide database sequences; 
BLASTX for nucleotide query sequences against protein database sequences; BLASTP 
for protein query sequences against protein database sequences; TBLASTN for protein 
query sequences against nucleotide database sequences; and TBLASTX for nucleotide 
query sequences against nucleotide database sequences. See, Current Protocols in 
Molecular Biology, Chapter 19, Ausubel, et al, Eds., Greene Publishing and Wiley- 
Interscience, New York (1 995). 

[0094] Unless otherwise stated, sequence identity/similarity values provided 

herein refer to the value obtained using the BLAST 2.0 suite of programs using default 
parameters. Altschul et al., Nucleic Acids Res. 25:3389-3402 (1997). Software for 
performing BLAST analyses is publicly available, e.g., through the National Center for 
Biotechnology-Information (http://www.hcbi.nlm.nih.gov/). This algorithm involves first 
identifying high scoring sequence pairs (HSPs) by identifying short words of length W in 
the query sequence, which either match or satisfy some positive-valued threshold score T 
when aligned with a word of the same length in a database sequence. T is referred to as 
the neighborhood word score threshold (Altschul et al, supra). These initial 
neighborhood word hits act as seeds for initiating searches to find longer HSPs 
containing them. The word hits are then extended in both directions along each sequence 
for as far as the cumulative alignment score can be increased. Cumulative scores are 
calculated using, for nucleotide sequences, the parameters M (reward score for a pair of 
matching residues; always>0) and N (penalty score for mismatching residues; always<0). 
For amino acid sequences, a scoring matrix is used to calculate the cumulative score. 
Extension of the word hits in each direction are halted when: the cumulative alignment 
score falls off by the quantity X from its maximum achieved value; the cumulative score 
goes to zero or below, due to the accumulation of one or more negative-scoring residue 
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alignments; or the end of either sequence is reached. The BLAST algorithm parameters 
W, T, and X determine the sensitivity and speed of the alignment. The BLASTN program 
(for nucleotide sequences) uses as defaults a word length (W) of 1 1, an expectation (E) of 
10, a cutoff of 100, M=5, N=-4, and a comparison of both strands. For amino acid 
sequences, the BLASTP program uses as defaults a word length (W) of 3, an expectation 
(E) of 10, and the BLOSUM62 scoring matrix (see Henikoff & Henikoff (1989) Proc. 
Natl. Acad. Sci. USA 89:10915). 

[0095] In addition to calculating percent sequence identity, the BLAST algorithm 

also performs a statistical analysis of the similarity between two sequences (see, e.g., 
Karlin & Altschul, Proc. Natl. Acad. Sci. USA 90:5873-5787 (1993)). One measure of 
similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), 
which provides an indication of the probability by which a match between two nucleotide 
or amino acid sequences would occur by chance. BLAST searches assume that proteins 
can be modeled as random sequences. However, many real proteins comprise regions of 
nonrandom sequences which may be homopolymeric tracts, short-period repeats, or 
regions enriched in one or more amino acids. Such low-complexity regions may be 
aligned between unrelated proteins even though other regions of the protein are entirely 
dissimilar. A number of low-complexity filter programs can be employed to reduce such 
low-complexity alignments. For example, the SEG (Wooten and Federhen, Comput. 
Chem., 17:149-163 (1993)) and XNU (Claverie and States, Comput. Chem., 17:191-201 
(1993)) low-complexity filters can be employed alone or in combination. 

[0096] (c) As used herein, "sequence identity" or "identity" in the context of two 

nucleic acid or polypeptide sequences includes reference to the residues in the two 
sequences which are the same when aligned for maximum correspondence over a 
specified comparison window. When percentage of sequence identity is used in reference 
to proteins it is recognized that residue positions which are not identical often differ by 
conservative amino acid substitutions, where amino acid residues are substituted for other 
amino acid residues with similar chemical properties (e.g. charge or hydrophobicity) and 
therefore do not change the functional properties of the molecule. Where sequences differ 
in conservative substitutions, the percent sequence identity may be adjusted upwards to 
correct for the conservative nature of the substitution. Sequences which differ by such 
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conservative substitutions are said to have "sequence similarity" or "similarity". Means 
for making this adjustment are well-known to those of skill in the art. Typically this 
involves scoring a conservative substitution as a partial rather than a full mismatch, 
thereby increasing the percentage sequence identity. Thus, for example, where an 
identical amino acid is given a score of 1 and a non-conservative substitution is given a 
score of zero, a conservative substitution is given a score between zero and 1. The 
scoring of conservative substitutions is calculated, e.g., according to the algorithm of 
Meyers and Miller, Computer Applic. Biol. Sci., 4:1 1-17 (1988) e.g., as implemented in 
the program PC/GENE (Intelligenetics, Mountain View, Calif, USA). 

[0097] (d) As used herein, "percentage of sequence identity" means the value 

determined by comparing two optimally aligned sequences over a comparison window, 
wherein the portion of the polynucleotide sequence in the comparison window may 
comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which 
does not comprise additions or deletions) for optimal alignment of the two sequences. 
The percentage is calculated by determining the number of positions at which the 
identical nucleic acid base or amino acid residue occurs in both sequences to yield the 
number of matched positions, dividing the number of matched positions by the total 
number of positions in the window of comparison and multiplying the result by 100 to 
yield the percentage of sequence identity. 

[0098] (e) (i) The term "substantial identity" of polynucleotide sequences means 

that a polynucleotide comprises a sequence that has at least 70% sequence identity, 
preferably at least 80%, more preferably at least 90% and most preferably at least 95%, 
compared to a reference sequence using one of the alignment programs described using 
standard parameters. One of skill will recognize that these values can be appropriately 
adjusted to determine corresponding identity of proteins encoded by two nucleotide 
sequences by taking into account codon degeneracy, amino acid similarity, reading frame 
positioning and the like. Substantial identity of amino acid sequences for these purposes 
normally means sequence identity of at least 60%, or preferably at least 70%, 80%, 90%, 
and most preferably at least 95%. 
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[0100] Another indication that nucleotide sequences are substantially identical is 

if two molecules hybridize to each other under stringent conditions. However, nucleic 
acids which do not hybridize to each other under stringent conditions are still 
substantially identical if the polypeptides which they encode are substantially identical. 
This may occur, e.g., when a copy of a nucleic acid is created using the maximum codon 
degeneracy permitted by the genetic code. One indication that two nucleic acid sequences 
are substantially identical is that the polypeptide which the first nucleic acid encodes is 
immunologically cross reactive with the polypeptide encoded by the second nucleic acid. 

[0101] (e) (ii) The terms "substantial Identity" in the context of a peptide 

indicates that a peptide comprises a sequence with at least 70% sequence identity to a 
reference sequence, preferably 80%, ore preferably 85%, most preferably at least 90% or 
95% sequence identity to the reference sequence over a specified comparison window. 
Optionally, optimal alignment is conducted using the homology alignment algorithm of 
Needleman and Wunsch, J. Mol. Biol. 48:443 (1970). an indication that two peptide 
sequences are substantially identical is that one peptide is immunologically reactive with 
antibodies raised against the second peptide. Thus, a peptide is substantially identical to a 
second peptide, for example, where the two peptides differ only by a conservative 
substitution. Peptides which are "substantially similar" share sequences as noted above 
except that residue positions which are not identical may differ by conservative amino 
acid changes. 

[0102] The terms "oligoclonal", "polyclonal" applied to cell populations indicates 

a population of cells where some cells within that population are not genetically identical 
to the rest of the cells of that population. Conversely, the term "monoclonal" or 
"monoclonal cell population" indicates that all cells within that population are genetically 
identical. Differences in the "genetic identity" of a population of cells in the context of 
this invention arise by random retroviral integration into different genomic insertion 
sites. 

[0103] The invention relates to a method for identifying a particular protein 

profile for a cell that can be used for diagnosis or as a target site for a drug. According to 
the invention activity at specific genetic loci is correlated with the functional state and/or 
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concentration of product proteins in a test cell and is then compared to a reference cell for 
elucidation of differential and quantitative protein expression profiles. 

[0104] The method can be used to identify gene or gene products associated with 

a biological process or state of interest, to identify proteins including entire pathways of 
expression associated with a particular state, to screen cells for the identification of a 
particular protein profile that is associated with said state for diagnosis, to identify novel 
proteins associated with particular cell types or states, or even to identify polymorphisms 
in genes causing differential proteins between genes. This can include the assay for 
expression or not of a particular protein or a relative quantitative expression profile. 

[0105] The method employs three basic steps to achieve its objective. First the 

test cell is transformed with a promoterless polynucleotide construct to "tag" the cell by 
integration of a detectable marker or reporter nucleotide sequence in the genome of the 
cell. The detectable marker sequence encodes a protein that is only produced when the 
integration event has occurred in a cellular gene in such a fashion that the marker protein 
is produced under the transcriptional control of a cellular gene promoter, resulting in an 
interrupted gene product and preferably a fusion protein incorporating the tag. 

[0106] This is achieved by inclusion of the marker nucleotide sequence in a 

polynucleotide construct, typically a vector, with no promoter operably linked to the 
marker nucleotide sequence. Thus expression of the marker is dependent upon initiation 
of transcription from the target cell genome. Any vector can be used according to the 
invention which is capable of integrating into the genome of said target cell, this can 
include but is not limited to, e.g., parvoviruses, foamy viruses, retrotransposons, etc., 
and/or naked DNA). In a preferred embodiment the vector is a defective retrovirus, 
packaging of the defective retroviral genome and insertion of the defective retroviral 
genome via abortive infection into the DNA of the cells to be analyzed. 

[0107] Production of the marker indicates that the construct has been integrated 

into an actively transcribed region of the cellular genome and production/accumulation of 
the marker protein becomes dependent upon transcription initiated at cellular promoters. 
Performance of the foregoing aspect of the instant invention results in a mixed population 
of cells wherein the marker is integrated into a different gene in each cell. In some 
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embodiments of the instant invention, this initial mixture of the cells may be separated 
into monoclones containing genetically identical cells in which the marker is integrated 
into the same gene. These monoclonal populations can be obtained by a number of 
different methods. For example, the original mixture of cells with the integrated marker 
can be seeded at a very low concentration, which allows each individual cell of the 
mixture to expand into a monoclone. 

[0108] The advantage of isolating individual clones is that these clones can be 

used, for example, as an assay system for screening of small molecule drugs that block 
the expression or function of the fusion protein tagged with the marker exon. This 
screening is based on measuring the variations in fluorescence signal produced by 
different drugs at different concentrations. 

[0109] In the second step of the invention, cells containing integrated sequences 

can be sorted and fractionated on the basis of the expression of the marker protein. Again 
any of a number of different sorting methodologies can be used depending on the 
chemical physical or mechanical characteristics of the marker gene. In a preferred 
embodiment the marker is a fluorescent protein and quantitation of protein is performed 
by Fluorescence Activated Cell Sorting (FACS) such that in addition to quantifying 
protein, cells expressing given levels of protein may be sorted and collected in fractions 
(there may be any number of fractions, e.g. 5, 10, 20, 25, 50, 100, etc., depending upon 
the desired level of resolution). Also, as mentioned above, the tagged population of cells 
can be sorted into individual clones, which would allow measuring the level of 
expression of each tagged protein in all possible stages of the cell cycle, and would also 
provide a direct link between the expression level of the fusion protein and the identity of 
the trapped gene. 

[0110] In addition other means such as the use of ferrous metal conjugates and 

electromagnetic force can be used for expression-dependent fractionation of cells. The 
FACS sorting can be employed to isolate single cells; each of these cells may then be 
used to establish an independent clone of homogeneous cells in which the marker is 
integrated into one particular gene. 
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[0111] The high speed and resolution of the invention allow for the first time, real 

time analysis of molecular pathways of activation including but not limited to signal 
transduction via phosphorylation levels of targets (i.e., direct measurements of 
phosphatases/kinases and/or production of cyclic AMP). The invention also provides 
analysis of gated and non gated channels to monitor signaling via Ca 2+ , Mg 2+ , Zn 2+ , pH 
and other trace elements. This analysis can also monitor protein-protein interactions as 
well using fluorescent Ab, fluorescent Ag, fluorescent ligand, fluorescent receptor, 
fluorescent substrates or non-fluorescent substrates that become fluorescent after 
enzymatic cleavage/activation or even conventional colorimetric enzyme-substrate based 
reactions. The combination of speed and precision offered by FACS has not been fully 
demonstrated using other methods of analysis; however, other methods may be used. In 
another embodiment electromagnetic forces can also be used to separate and quantitate 
cells by the expression of the marker gene product. 

[0112] In the final step, once sorted into oligoclonal or monoclonal 

subpopulations (based upon the level of marker peptide expression), DNA, RNA, and/or 
protein are isolated from the cells in each subpopulation'monoclone and analyzed. This 
analysis includes determination of the cellular DNA sequences into which the marker 
DNA has been inserted. Then comparison is made against a reference cell to identify 
differential protein expression that is correlated with a particular state, for example a 
disease state such as cancer for use as a diagnostic or to identify a potential target for 
drug intervention. 

[01 13] Rather than isolating and amplifying each individual cell which has 

acquired an integrated and expressed marker tag and then subsequently analyzing the site 
of insertion and the level of protein expression, estimates of these values may be obtained 
by analyzing statistically significant numbers of cells with such integrants that have been 
clustered together by virtue of demonstrating an approximately equivalent level of gene 
expression. 

[0114] For example, if in a population of 1,000,000 cells, 10,000 cells with 

integration events analyzing a series of different genes in a population demonstrate a 
mean marker peptide concentration of value x, where x represents the mean marker 
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peptide concentration seen in the first percentile of cells detected to have any expression 
of the marker peptide. If integration sites are determined in any (or all) of these cells, the 
genes where they have integrated are said to be expressed in the lowest percentile of 
detectable expression. This can be said independent of any knowledge that relates a 
specific integration event in a specific gene in a specific clone of cells to a specific 
protein level. By applying appropriate statistical methods and examining a large enough 
number of integration events, the need to obtain and analyze specific clones is obviated 
for the purpose of determining relative levels of protein expression from a given genetic 
locus. This initial cluster analysis of a subpopulation may be followed up by analysis of 
individual genes from monoclonal populations of cells isolated as described above. 

[0115] In further embodiments the differential expression data can be used with 

statistical methods to assign marker peptide expression levels for each interrupted gene. 
In further yet embodiments a database which incorporates this newly generated data, with 
other data sources, is combined to produce a record on the relationship of gene 
expression (at the RNA and protein level) to the function of the cell can be generated. In 
cases where the cells under study can be obtained in both cancerous and normal 
conditions, comparisons of the relative gene expression can be used to identify genes 
which can serve either as diagnostic markers of pathology or as sites of pharmacologic 
intervention for treatment of cancer. Similarly, other diseases can be analyzed merely by 
substituting the source of cells for analysis. 

[0116] Each of the foregoing steps will now be described in more detail below. It 

is understood that for each step numerous expedients may be employed as well as 
alternative molecular biology techniques currently or yet to become available which 
achieve the same results. Choice of reaction agents, protocols etc is considered nothing 
more than routine optimization of experimental parameters based upon the teaching 
herein and are intended to be within the scope of the invention. FIG. 19 is a flow diagram 
depicting an overview of the process including several specific examples of alternatives 
available for each step. 
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Identification and Selection of Test and Reference Cell Types 



[0117] According to the invention a comprehensive protein profile is generated 

from any cell type of interest. A test cell can be any cell, or portion thereof with genetic 
material. A reference cell can be any cell type the difference in protein expression 
patterns and levels is desired to be measured against. Preferably the cells are maintained 
as similar to their native state as possible and culture techniques, incubation times etc. are 
performed identically between the two to minimize any non-naturally occurring 
differences. For example, development of the comprehensive protein profiles of pre- 
cancerous, and malignant test cells and a normal reference cell could be achieved 
according to the invention. Such identifiers of protein signatures will characterize 
molecular events of tumor development and cellular mechanisms involved. 

[0118] Recent initiatives in identification of molecular fingerprints of tumors 

have been focused on studies of DNA and mRNA levels/These studies indicate that gene 
expression paths in two tumor samples from the same individual were almost always 
more similar to each other than either was to any other sample and that tumors could be 
classified in subtypes distinguished by differences in their gene expression patterns. 

[01 19] According to the invention, a test cell and a reference cell could be 

obtained from the same patient to get an individual protein fingerprint that can be used to 
diagnose or treat that patient. For example when a tumor is excised, often a margin of non 
transformed cells is removed as well. Protein profiling can help to ensure that the cells 
removed all had similar profiles to normal cells rather than the metastatic cells from the 
same patient. 

[0120] Comparisons may be made according to the invention from different 

cancers (e.g. lung, breast, colon, melanoma), different stages of malignant progression 
from corresponding normal tissue to highly malignant primary site and/or metastatic site, 
tumors caused by endemic/local agents (e.g. environmental agents (asbestos, infectious 
agents), tissues surrounding the incipient tumor (e.g. blood cells), extracts from body 
fluids (e.g. cancer cells of the urinary tract may be shed into urine), and tumors from 
species other than human. 



-36- 



[01211 One example of cell lines that may be used as test cells include human 

tumor cell lines. For example human tumor cell lines representing a broad spectrum of 
human tumors and exhibiting acceptable properties and growth characteristics may be 
grown according to standard operating procedure for cell line expansion, 
cryopreservation and characterization. Examples of human cancer cell lines which may 
be used according to the invention include: Lung Cancer Human Cell Lines (Non-small 
cell lung cancer adenocarcinoma cell line, A549; adenosquamous cell carcinoma, NCI- 
H125; squamous cell carcinoma, SK-MES-1, bronchial-alveolar carcinoma, NCI-M322; 
large cell Carcinoma, A 427, mucoepidermoid carcinoma, NCIM292, small cell lung 
cancer (SCLC) "Classic", NCI-M69; SCLC "Variant", NCI-M82; SCLC "Adherent", 
SHP77; colon cancer human cell lines (COLO 205, DLD-1, HCT-15, HT29, LoVo); 
breast cancer human cell lines, (MCF7 WT, MCF7 ADR, MDA-MB-23 1 , HS 578T); 
prostate cancer human cell lines (D4 145, LNCaP, PC-3, UMSCP-1); melanoma human 
cell lines (RPMI-7951, LOX, SK-MEL 2, SK-MEL-5, A 375); renal cancer human cell 
lines (A 498, A 704, Caki-1, SNI2 C, UO-31); ovarian cancer human cell lines (IGROV- 
1, OVCAR-3, SK-OV-3, A2780, OVCAR-4, OVCAR-5, OVCAR-8); leukemia human 
cell lines (Molt-4, RPMI 8336, P388, P388/ADR-Resist CCRF-CEM, CCRF-SB); central 
nervous system cancer human cell lines (SF 126, SF 295, SNB19, SNB 44, SNB 56, TE 
671, 4251); sarcoma human cell lines (A-204, A 673, MS 913T, Ht 1080, Te 85); head 
and neck squamous cancer human cell lines (UM-SCC-MB,C, UM-SCC-21 A, UM-SCC- 
22B); normal fibroblasts (MRC-5-lung, human, CCD-194Lu-lung, human, IMR-90-lung, 
human, NIH 3T3-mouse, embryo). 

[0122] Another example of cell types which could be used includes primary cells 

derived from normal or cancer tissue specimens such as a tissue specimen obtained from 
normal and/or cancerous tissue that is disaggregated using dissociating enzymes and 
single cell suspension that is enriched, purified and characterized using MACS tumor cell 
reagents. 

[01 23] In yet another embodiment test and reference cells can be used to develop 

protein profiles associated with aging such as different stages of ontogenesis, for example 
protein profiles of embryonic liver-derived hematopoietic stem cell (HSC) vs. cord blood 
HSC vs. young adult HSC vs. old age organism-derived HSC. 
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[0124] In yet another embodiment protein profiles of cells from 

Neurodegenerative diseases which as patients with Alzheimer's disease, Parkinson's 
disease may be assayed. 

[0125] In yet another embodiment profiles may be obtained for other age-related 

conditions such as male pattern baldness. 

[0126] In yet another embodiment protein profiles can be obtained from human 

pathological conditions such as genetic diseases (inborn errors of metabolism: Adenosine 
deaminase deficiency, cystic fibrosis, Duchenne's muscular dystrophy). 

[0127] In yet another embodiment protein profiles may be obtained for 

multifactorial and somatic genetic diseases (hypertension, coronary artery disease, 
obesity, and diabetes mellitus). 

[0128] In yet still another embodiment profiles may be obtained for other non- 

genetic diseases (AIDS and other infectious diseases). 

[0129] In yet still another embodiment profiles may be obtained for autoimmune 

disorders (rheumatoid arthritis, systemic lupus erythematosus, multiple sclerosis, etc.). 

[0130] In yet another embodiment profiles may be obtained for human non- 

disease traits such as physical traits (athletic abilities, visual acuity); cognitive and 
personality traits (musical ability, cognition, memory, male-pattern-baldness). 

[0131] In yet another embodiment two cells of the same type may be assayed to 

identify alternative gene forms, such as polymorphic loci etc. 

[0132] Further, as can be seen, any cell type can be used according to the 

invention including but not limited to, microorganisms, plants, invertebrates, vertebrates, 
and mammals. 

Integration of Assay Marker Peptide-encoding Sequences into the Genetic Material 

of Test and Reference Cells. 

[0133] According to the invention the process begins by the insertion of an assay 

marker DNA sequence into the genome of a test cell to be analyzed. This assay marker 
sequence includes any expressed molecule which can be screened in a defined assay 
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system such that the cells may be identified, selected, sorted and/or preferably quantified, 
based upon the expression of the marker sequence. In a preferred embodiment this 
marker sequence or (tag) will be a chromophore which will fluoresce (such as humanized 
renilla green fluorescence protein). Other examples of assay marker sequences which 
may be used according to the invention include a- 1-3 galactosyltransferase, 
sodium/iodine symporter, (or viral envelope protein could be used). Still other marker 
systems include but are not limited to any detectable cell surface displayed protein; other 
markers can be used such as lipid, lipoprotein, glycolipid, and glycoprotein targets that 
can be tagged with specific fluorescent compounds using labeled antibodies, direct 
chemical linkage and/or combination of direct and indirect tagging. 

[0134] The marker peptide/fusion protein may be intracellular and dispersed 

throughout the cytoplasm or localized to specific intracellular compartments by leader 
sequences present on the marker peptide/fusion proteins(s). The marker peptide(s) can be 
incorporated into a single protein or into macromolecular complexes in which several 
different proteins (derived from multiple cellular genes) are linked by specific molecular 
interactions that demonstrate a unique fluorescent profile. 

[0135] The marker peptide is only produced when the integration event has 

occurred in a cellular gene in such a fashion that the marker protein is produced under the 
transcriptional control of a cellular gene promoter. This is achieved by inclusion of the 
marker DNA sequence within a promoterless expression construct. 

[0136] In a preferred embodiment the expression construct is included within an 

appropriate gene transfer vehicle which is then used to transduce cells to express the 
marker gene by the recipient host test cells. The gene delivery vehicle can be any delivery 
vehicle known in the art and can include simply naked DNA which is facilitated by a 
receptor mediated transfection or via homologous recombination, (see FIG. 15). In a 
homologous recombination embodiment a vector is engineered to have highly repeated 
sequences such as Alu flanking the assay marker gene so that recombination is facilitated 
at the repetitive sites causing integration of the nucleotide. Any of a number of vectors 
can be used, such vectors include but are not limited to eukaryotic vectors, prokaryotic 
vectors (such as for example bacterial vectors) and viral vectors including but not limited 
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to retroviral vectors, adenoviral vectors, adeno-associated viral vectors, lentivirus vectors 
(human and other including porcine), or any other vector which will stably integrate into 
the host cell genome. 

[0137] The preferred embodiment of the invention will use vectors (DNA, RNA, 

DNA/RNA hybrids etc.) that contain markers which may be sorted to include but not 
limited to cell surface displayed or cytoplasmic protein; lipid, lipoprotein, glycolipid, and 
glycoprotein targets that can be tagged with specific fluorescent, chemi luminescent, or 
bioluminescent compounds using labeled antibodies, direct chemical linkage and/or 
combination of direct and indirect tagging. These vectors (see FIG. 2A-K, 13 A) use 
either the processes of illegitimate recombination, homologous recombination, and/or 
viral vectors to integrate said markers into the genomic DNA of target cells (the 
integrated vector serves as a molecular bar code). Alu sequences are approximately 300 
bp in length and are found on average every 3000 bp in the human genome. Alu or other 
highly repetitive sequences can be used to induce homologous recombination for 
insertion of the marker gene. The vectors will be delivered to the target cells via standard 
gene delivery methods to include but not limited to lipid mediated transfection (cationic, 
anionic, and neutral charged), activated dendrimers (PolyFect™ Reagent, SuperFect™ 
Reagent {Qiagen}), Phenylethyleneimide (PEI), receptor mediated transfection 
(fusogenic peptide/protein), calcium phosphate transfection, electroporation, particle 
bombardment, direct injection of naked-DNA, diethylaminoethyl (DEAE-dextran 
transfection) etc. Though the preferred embodiment is the use of plasmid based vectors, 
the use of other high efficiency viral vectors is not precluded. 

[0138] The expression vehicles (vectors) of the invention can be engineered by 

any of a number of techniques known to those of skill in the art. The following is a 
summary of techniques for construction and transformation of the vectors of the 
invention. 

Genetic Engineering Techniques for Construction and Delivery of Vectors 

[0139] In a preferred embodiment the expression vehicles or vectors of the 

invention comprising the expression system also comprise a selectable marker gene to 
select for transformants as well as a method for selecting those transformants for 
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propagation of the construct in bacteria. Such selectable marker may contain an antibiotic 
resistance gene, such as those that confer resistance to ampicillin, kanamycin, 
tetracycline, or streptomycin and the like. These can include genes from prokaryotic or 
eukaryotic cells such as dihydrofolate reductase or multi-drug resistance I gene, 
hygromycin B resistance that provide for positive selection. Any type of positive selector 
marker can be used such as neomycin or Zeocin and these types of selectors are generally 
known in the art. Several procedures for insertion and deletion of genes are known to 
those of skill in the art and are disclosed. For example in Maniatis, "Molecular Cloning", 
Cold Spring Harbor Press. See also Post et al., Cell, Vol. 24:555-565 (1981). An entire 
transcription unit must be provided for the selectable marker genes (promoter-gene- 
polyA) and the genes must be flanked on one end or the other with promoter regulatory 
region and on the other with transcription termination signal (polyadenylation cite). Any 
known promoter/transcription termination combination can be used with the selectable . 
marker genes. Examples of such systems include /3-lactamase (penicillinase) and lactose 
promoter systems, (Chang et al., Nature, 1977, 198:1056); the Tryptophan (tip) promoter 
system (Goeddel, et al., Nucleic Acid Res., 1980, 8:4057) and the lambda derived PI 
promoter and N-gene ribosome binding site (Shimatake et al., Nature 1981, 292:128). 
Other promoters such as cytomegalovirus promoter or Rous Sarcoma Virus can be used 
in combination with various ribosome elements such as SV40 poly A. The promoter can 
be any promoter known in the art including constitutive, (supra) inducible, (tetracycline- 
controlled transactivator (tTA)-responsive promoter (tet system, Paulus, W. et al., "Self- 
Contained, Tetracycline-Regulated Retroviral Vector System for Gene Delivery to 
Mammalian Cells", J of Virology, January 1996, Vol. 70, No. 1, pp. 62-67)), or tissue 
specific, (such as those cited in Costa, et. Al., European journal of Biochemistry, 258 
"Transcriptional Regulation of the Tissue-Type Plasminogen Activator Gene in Human 
Endothelial Cells: Identification of Nuclear Factors That Recognize Functional Elements 
in the Tissue-Type Plasminogen Activator Gene Promoter" pgs, 123-131 (1998); 
Fleischmann, M., et al., FEBS Letters 440 "Cardiac Specific Expression Of The Green 
Fluorescent Protein During Early Murine Embryonic Development" pgs. 370-376, 
(1998); Fassati, Ariberto, et al., Human Gene Therapy, (9:2459-2468) "Insertion Of Two 
Independent Enhancers In The Long Terminal Repeat Of A Self Inactivating Vector 
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Results In High-Titer Retroviral Vectors With Tissue Specific Expression" (1998); 
Valerie, Jerome, et. Al. Human Gene Therapy 9:2653-2659, "Tissue Specific Cell Cycle 
Regulated Chimeric Transcription Factors for the Targeting Of Gene Expression to 
Tumor Cells, (1998); Takehito, Igarashi, et. AL, Human Gene Therapy 9:2691-2698, "A 
Novel Strategy of Cell Targeting Based on Tissue-Specific Expression of the Ecotropic 
Retrovirus Receptor Gene", 1998; Lidberg, Ulf et.al. The Journal of Biological Chemistry 
273, No.47, "Transcriptional Regulation of the Human Carboxyl Ester Lipase Gene In 
Exocrine Pancreas" 1998; Yu, Geng-Sheng et. Al., the Journal of Biological Chemistry 
273 No. 49, "Co-Regulation of Tissue-Specific Alternative Human Carnitine 
Palmitoyltransferase IB Gene Promoters by Fatty Acid Enzyme Substrate" (1998)). These 
types of sequences are well known in the art and are commercially available through 
several sources, ATCC, Pharmacia, Invitrogen, Stratagene, and Promega. Alternatively, 
the marker delivery vector may not contain an independent transcription unit encoding a 
selectable marker, and selection in this case can be made solely based on the biological 
and phenotypic properties of the assay marker gene used for exon trapping. 

[0140] The assay marker gene to be expressed can then be introduced into the 

vector of the invention. The foreign marker gene DNA typically will comprise a 
promoterless transcription unit. 

[0141] In a most preferred embodiment the vector comprises a specifically 

engineered multi-cloning site within which several unique restriction sites are created. 
Restriction enzymes and their cleavage sites are well known to those of skill in the art. 

[0142] In a preferred embodiment, a packaging cell line is transduced with a viral 

vector containing the marker nucleotide sequence to form a producer cell line including 
the viral vector. The producer cells may then be directly administered, whereby the 
producer cells generate viral particles capable of transducing the recipient cells. 

[0143] In a preferred embodiment, the viral vector is a retroviral vector. 

Examples of retroviral vectors which may be employed include, but are not limited to, 
Moloney Murine Leukemia Virus, spleen necrosis virus, and vectors derived from 
retroviruses such as Rous Sarcoma Virus, Harvey Sarcoma Virus, avian leukosis virus, 
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human immunodeficiency virus, myeloproliferative sarcoma virus, and mammary tumor 
virus. 

[0144] Retroviral vectors are useful as agents to mediate retroviral-mediated gene 

transfer into eukaryotic cells. Retroviral vectors are generally constructed such that the 
majority of sequences coding for the structural genes of the virus are deleted and replaced 
by the therapeutic gene(s) of interest. Most often, the structural genes (i.e., gag, pol, and 
env), are removed from the retroviral backbone using genetic engineering techniques 
known in the art. This may include digestion with the appropriate restriction 
endonuclease or, in some instances, with Bal 31 exonuclease to generate fragments 
containing appropriate portions of the packaging signal. 

[0145] The marker gene may be incorporated into the proviral backbone in 

several general ways. The most straightforward constructions are ones in which the 
structural genes of the retrovirus are replaced by a single gene which then is transcribed 
under the control of the viral regulatory sequences within the long terminal repeat (LTR). 
Retroviral vectors have also been constructed which can introduce more than one gene 
into target cells. Usually, in such vectors one gene is under the regulatory control of the 
viral LTR, while the second gene is expressed either off a spliced message or is under the 
regulation of its own, internal promoter. However, in the context of the instant invention, 
a promoterless marker gene has to be introduced within the vector backbone, preferably 
in an orientation which is inverted with respect to that of the viral transcription so the 
splicing signals carried by the marker exon do not interfere with the vector splicing 
signals. 

[0146] Efforts have been directed at minimizing the viral component of the viral 

backbone, largely in an effort to reduce the chance for recombination between the vector 
and the packaging-defective helper virus within packaging cells. A packaging-defective 
helper virus is necessary to provide the structural genes of a retrovirus, which have been 
deleted from the vector itself. 

[0147] In one embodiment, the retroviral vector may be one of a series of vectors 

described in Bender, et al., J. Virol. 61:1639-1649 (1987), based on the N2 vector 
(Armentano, et al., J. Virol., 61:1647-1650) containing a series of deletions and 
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substitutions to reduce to an absolute minimum the homology between the vector and 
packaging systems. These changes have also reduced the likelihood that viral proteins 
would be expressed. In the first of these vectors, LNL-XHC, there was altered, by site- 
directed mutagenesis, the natural ATG start codon of gag to TAG, thereby eliminating 
unintended protein synthesis from that point. 

[0148] In Moloney murine leukemia virus (MoMuLV), 5' to the authentic gag 

start, an open reading frame exists which permits expression of another glycosylated 
protein ( ppr80 gag). Moloney murine sarcoma virus (MoMuSV) has alterations in this 5' 
region, including a frameshift and loss of glycosylation sites, which obviate potential 
expression of the amino terminus of pPr80 gag. Therefore, the vector LNL6 was made, 
which incorporated both the altered ATG of LNL-XHC and the 5 f portion of MoMuSV. 
The 5' structure of the LN vector series thus eliminates the possibility of expression of 
retroviral reading frames, with the subsequent production of viral antigens in genetically 
transduced target cells. In a final alteration to reduce overlap with packaging-defective 
helper virus, Miller has eliminated extra env sequences immediately preceding the 3* 
LTR in the LN vector (Miller, et al., Biotechniques, 7:980-990, 1989). 

[0149] The paramount need that must be satisfied by any gene transfer system for 

its application to gene therapy is safety. Safety is derived from the combination of vector 
genome structure together with the packaging system that is utilized for production of the 
infectious vector. Miller, et al. have developed the combination of the pPAM3 plasmid 
(the packaging-defective helper genome) for expression of retroviral structural proteins 
together with the LN vector series to make a vector packaging system where the 
generation of recombinant wild-type retrovirus is reduced to a minimum through the 
elimination of nearly all sites of recombination between the vector genome and the 
packaging-defective helper genome (i.e. LN with pPAM3). 

[0150] In one embodiment, the retroviral vector may be a Moloney Murine 

Leukemia Virus of the LN series of vectors, such as those hereinabove mentioned, and 
described further in Bender, et al. (1987) and Miller, et al. (1989). Such vectors have a 
portion of the packaging signal derived from a mouse sarcoma virus, and a mutated gag 
initiation codon. The term "mutated" as used herein means that the gag initiation codon 
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has been deleted or altered such that the gag protein or fragment or truncations thereof, 
are not expressed. 

[0151] In another embodiment, the retroviral vector may include at least four 

cloning, or restriction enzyme recognition sites, wherein at least two of the sites have an 
average frequency of appearance in eukaryotic genes of less than once in 10,000 base 
pairs; i.e., the restriction product has an average DNA size of at least 10,000 base pairs. 
Preferred cloning sites are selected from the group consisting of NotI, SnaBI, Sail, and 
Xhol. In a preferred embodiment, the retroviral vector includes each of these cloning 
sites. 

[0152] When a retroviral vector including such cloning sites is employed, there 

may also be provided a shuttle cloning vector which includes at least two cloning sites 
which are compatible with at least two cloning sites selected from the group consisting of 
NotI, SnaBI, Sail, and Xhol located on the retroviral vector. The shuttle cloning vector 
also includes at least one desired gene which is capable of being transferred from the 
shuttle cloning vector to the retroviral vector. 

[0153] The shuttle cloning vector may be constructed from a basic "backbone" 

vector or fragment to which are ligated one or more linkers which include cloning or 
restriction enzyme recognition sites. Included in the cloning sites are the compatible, or 
complementary cloning sites hereinabove described. Genes and/or promoters having ends 
corresponding to the restriction sites of the shuttle vector may be ligated into the shuttle 
vector through techniques known in the art. 

[0154] The shuttle cloning vector can be employed to amplify DNA sequences in 

prokaryotic systems. The shuttle cloning vector may be prepared from plasmids generally 
used in prokaryotic systems and in particular in bacteria. Thus, for example, the shuttle 
cloning vector may be derived from plasmids such as pBR322; pUC 18; etc. 

[0155] The vector then is employed to transduce a packaging cell line to form a 

producer cell line. Examples of packaging cells which may be transfected include, but are 
not limited to the PE501, PA317, PSI-2, PSI-AM, PA12, T19-14X, VT-19-17-H2, PSI- 
CRE, PSI-CRD?, GP+E-86, GP+envAM12, and DAN cell lines. The vector containing 
the therapeutic nucleic acid sequence may transduce the packaging cells through any 
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means known in the art. Such means include, but are not limited to, electroporation, the 
use of liposomes, and calcium phosphate co-precipitation. The producer cells then are 
administered directly to or adjacent to desired recipient cells. 

[0156] In another embodiment, the retroviral vectors can be based on human 

immunodeficiency virus Type I, using backbones for vector and helper packaging 
plasmids as described by Naldini et aL, Science 1996, 272: 263-267; Zufferey et al., 
Nature Biotechnology 1997, 15: 871-875; and Reiser et al., Proc. Natl. Acad. Sci. USA 
1996, 93: 15266-15271. Moreover, these vectors can withstand a deletion in the 3' U3 
region of the 3' LTR that turns them into self-inactivating vectors after integration into 
the target genome, without a negative impact in vector titers (Zufferey et al., J of 
Virology 1998, 72: 9873-9880; Miyoshi et al., J. of Virology 1998, 72: 8150-8157). In 
the context of the present invention, the self-inactivating modification would avoid 
transcription of RNA from the viral 5 'LTR that would generate an antisense RNA to the 
cellular gene being trapped by the assay marker gene delivered by the vector. 

[0157] Integration occurs within the transcribed region of a cellular gene in a 

fashion that renders the production/accumulation of the marker protein dependent upon 
transcription initiated at cellular promoters. 

[0158] In yet another preferred embodiment the polynucleotide vector may 

include a "splice acceptor site" so that if the vector integrates in the proper orientation 
within an intron encoding region of a cellular gene, the marker protein is produced as a 
fusion product with a portion of whatever cellular protein is encoded by the gene where 
the insertion event has occurred (Inclusive but not limited to, e.g., inclusion of an internal 
ribosome entry site (IRES) prior to the start codon of the marker gene ensures it will be 
expressed whenever RNA from the cellular gene (where integration has occurred) is 
transported to the cytoplasm in a form that is translatable). (FIG. 1) 

[0159] Similarly, multiple markers may be included such that one marker protein 

may be expressed as a fusion and a second marker protein may be expressed from an 
IRES (FIG. 2C). Constructions are also possible to acquire different pieces of information 
about integration sites, depending upon the positioning of splice acceptor and donor sites. 
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[0160] According to the invention serial gene-trapping vectors for the acquisition 

of the data needed to assign integration sites to specific genes and to mean marker protein 
expression levels. Examples of such vectors are shown on FIG. 2A-2K. 

[0161] As shown in FIG. 15 the issue of frameshift is also important to consider 

as only 1 in 3 integrants will be functional. Due to the triplet organization of translation 2 
out of 3 integrations will not result in functional assay marker production as they will 
result in frame shifts which will disrupt the translation of the marker gene despite its 
integration into an active region of the cellular genome. Thus in yet another preferred 
embodiment of the invention, a plurality of vectors are constructed which are only one 
base or two bases different from the start site of the marker gene. This will help to trap 
some exons which are not in frame after integration of the marker nucleotide. 

[0162] The production of these various vectors is readily achieved by those 

skilled in the art. One methodology which may be used for creating the vectors is as 
follows. The vectors to be produced as defective retroviruses are transfected into a vector 
packaging cell lines containing a helper virus (inclusive of, but not limited to, retroviral 
AMIZ helper virus, or other retroelements (Young, W. B. and C. J. Link, Jr., Chimeric 
retroviral helper virus and picornavirus IRES sequence To eliminate DNA methylation 
for improved retroviral packaging cells. J. Virol, 2000. 74(11): p. 5242-9) which can 
prevent the unwanted silencing of helper virus by cellular DNA methylation (Young, W. 
B., G. L. Lindberg, and C. J. Link, Jr., DNA methylation of helper virus increases genetic 
instability of retroviral vector producer cells. J Virol, 2000. 74(7): p. 3177-87). This 
AMIZ helper virus -packaging cell line can produce vector titer up to 2 x 10 7 CFU 
(colony formation unit)/ml. 

[0163] In certain circumstances where the production of retrovirus is limited 

alternative methods of retroviral production can be performed using a chimeric 
adenovirus system to produce vector titers up to 5 x 10 9 cfu/ml (Ramsey et al., Caplen et 
al.). 



-47- 



Sorting of Cells Based Upon Levels of Marker Peptide Expression 

[0164] Cells which express the marker are then sorted and preferably quantified 

by their level of expression to generate an expression profile for a particular cell type. 
Sorting or separation of the cells can be by any method which provides for the separation 
and preferably quantification based upon expression of the marker sequence. This could 
be by fluorescence activation sorting, mechanical sorting, charge or density magnetic or 
other methods. 

[0165] A preferred method of sorting includes the use of flow cytometry. Flow 

cytometry seeks to utilize complex integration of optic, fluidic, and electronic 
components to develop fluorescence activated cell sorters (FACS) capable of rapid 
interrogation of cells containing useful fluorescent marker/s in real time. 

[0166] Marker which may be sorted by this method include cell surface displayed 

protein; lipid, lipoprotein, glycolipid, and glycoprotein targets that can be tagged with 
specific fluorescent compounds using labeled antibodies, direct chemical linkage and/or 
combination of direct and indirect tagging. 

[0167] One alternative embodiment includes the use of high-sensitivity/high- 

18 -21 

density plate readers to detect chemiluminescent signals (range 1x10" M to 1 x 10" 
M) or with concomitant decreased sensitivity conventional plate reader technology can be 
used to measure absorbance of enzyme based chromophores. A method for sorting cells 
with similar speed to that of conventional FACS may be employed where the electrical 
charging plates are replaced with high performance electromagnets that allow magnetic 
based separation. Alternatively, confocal microscopy will allow increased sensitivity but 
with significant reduction in throughput. 

[0168] In a preferred embodiment the assay marker peptide is a naturally 

fluorescent protein fusion product that includes but is not limited to humanized Renilla 
reniformis green fluorescent protein (hrGFP) with FACS separation. Examples of 
uncloned GFP molecules useful for practice of the invention have been sited in Cormier, 
M. J., Hori, K., and Anderson, J. M. (1974) Bioluminescence in Coelenterates. Biochim. 
Biophys. Acta 346:137-164. In cases where fluorescent signal of the tagged fusion 
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proteins are of insufficient magnitude to be useful the cells may be probed again with 
enzyme labeled fluorescence. 

[0169] In a further embodiment ELISA and Western blotting may be used to 

establish correlation curves that increase the accuracy of the protein estimations. 
Alternatively, RIA and/or immunoprecipitations can be used to establish standard 
correlation curves of target protein content. Preferably a consistent standard and 
calibrator set of beads will be developed with a known number of molecules of 
fluorescent protein bound per bead. These standard beads will allow correlation of 
fluorescent intensity to molecules equivalent soluble fluorescence. 

[0170] In another optional embodiment the expression construct includes a 

polynucleotide with a negative or positive selection protein for enrichment of the 
population prior to sorting. Use of the negative or positive selection will remove from the 
population all cells with no integration of the polynucleotide, for example via antibiotic 
resistance. This provides for enriched populations of target cells to overcome any relative 
inefficiency of the gene trapping of genomic control elements. Enrichment of gene 
trapped cells will include the use of drug selection (e.g., neo r , puro r , hygro r , zeo r , HAT 1 ), 
affinity separations to include but not limited to Ab/Ag or Ab/hapten, biotin/streptavidin, 
glutathione S-transferase (GST) fusion proteins, Polyhistamine fusion proteins 
(Invitrogen), calmodulin-binding peptide tag (Stratagene), HA epitope tag 
(YPYDVPDYA), c-myc epitope tag (peptide seq. EQKLISEEDL) (Stratagene), FLAG 
epitope tag (peptide seq. DYKDDDDK) (Stratagene), V5 epitope (Stratagene), the 
Linx™. technology {phenyldiboronic acid [PDBA] and salicylhydroxamic acid [SHA] 
(Invitrogen), adhesion, blocking of adhesion, chemotaxis, block of chemotaxis etc., 
and/or enrichment by FACS using fluorescent Ab, fluorescent Ag, fluorescent substrates 
or non- fluorescent substrates that become fluorescent after enzymatic cleavage/activation 
(A complete listing of common fluorescent probes used for our applications can be found 
in references: Shapiro, H. M., Practical Flow Cytometry, Third Edition, Wiley-Liss 
(1994), Robinson, J. P., Handbook of Flow Cytometry Methods, Wiley-Liss (1993); 
Ormerod, M. G., Flow Cytometry: A Practical Approach, Second Edition, ERJL Press 
(1994); Robinson, J. P., Current Protocols in Cytometry, John Wiley & Sons (2000). 



-49- 



[0171] Alternatively, some applications may use depletion of cells that 

demonstrate very high levels of protein expression to allow finer fractionation of cells 
demonstrating lower expression of the marker peptide (e.g., negative selection including 
but not limited to HSV tk/GCV). This negative selection can be applied before or after a 
positive selection process. 

[0172] According to the invention populations of marker peptide (gene trapping) 

cells will be sorted by FACS into various levels of expression based on the distribution of 
number of cells and relative fluorescent intensity. The cells which may be either viable or 
fixed in preservatives (e.g., Para formaldehyde) will then be sorted into groups based on 
mean fluorescence intensity. The process is equally efficient with dead fixed and non 
fixed cell or cells that have been permeabilized and probed with fluorescently labeled Ab 
or enzyme labeled fluorescent probes to increase sensitivity. Cytometry 23, 46 (1996); J 
Histochem Cytochem 43, 77 (1995). The process of sorting will yield either mixed 
populations of cell clones displaying similar mean fluorescence intensity or separated 
individual cell clones, each of which will display a distribution of fluorescence values 
which is characteristic of the particular gene trapping event and the particular fusion 
protein generated between the assay marker gene and the endogenous cellular protein 
being measured. An example of normal (HMEC) and cancer (MCF7) cell populations 
were transduced with different vectors (HSG and GTFSO) and sorted in four different 
fractions displaying different mean fluorescence levels is shown in FIGS. 14A-14D. 

[0173] Once the sorting process is completed DNA, RNA, and/or total protein 

may be extracted and subjected to down stream amplification and/or analysis (although 
live cells could be returned to culture for further amplification if it is deemed useful to do 
so). 

Sequence Tag Acquisition and Reporting 

[0174] Once cells are separated out according to their fusion protein expression 

levels, the fusion protein is associated with particular genomic loci by defining the 
flanking sequences around the integration site of the marker peptide expressing retroviral 
vectors (e.g., molecular DNA bar code). 
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[0175] Fractionation of the entire population of cells (with measurable levels of 

marker protein expression) into subpopulations of cells (each subpopulation comprised of 
cells with similar levels of marker protein expression) is followed by analysis of 
integration sites within the cells of a subpopulation. As integration sites are identified 
(and correlated with a genetic locus) they are assigned the mean subpopulation marker 
protein level as a measure of relative expression. When complete, all analyzed integration 
sites/genes will have a relative level of protein expression assigned to them. 

[0176] For the purpose of this example, we describe a method for the acquisition 

of the data needed to assign integration sites to specific genes and to mean marker protein 
expression levels. It is inclusive of, but not limited to, the examples listed below: 

A. The method of inverse PCR procedure for recovering genomic tags associated to 

vector or viral integration events 

[0177] As shown in Figure 17, recovery of genetic material from the cells to be 

analyzed, in this example cellular DNA (inclusive of, but not limited to, cellular DNA 
since complementary DNA derived from cellular RNA (cDNA) may be used), the 
composition of which is partially known to the operator by virtue of the inclusion of the 
sequences encoding the marker peptide. The genetic locus containing the inserted 
sequence (or producing the RNA containing inserted marker gene sequences) is known as 
the "tagged gene." 

[01 78] A method of cleaving said cellular DNA such that inserted DNA (with 

sequence known to the operator) is cleaved once and flanking cellular DNA of unknown 
sequence is cleaved again in the regions contiguous to the inserted piece of DNA. 
Cleavage of the DNA occurs in a fashion generating ends that permit the circularization 
of DNA fragments producing a molecule with the sequence known to the operator 
flanking both sides, and continuous with, a variable length of cellular DNA of unknown 
sequence. 

[0179] Primers comprised of sequences drawn from that sequence known to the 

operator (part of the expression vector) are used in the amplification of unknown 
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sequences, in this example by polymerase chain reaction (inclusive of, but not limited to, 
since other means of amplifying sequences as RNA molecules could be used). These 
primers are selected to bind to the circularized product described previously, in the 
regions of DNA whose sequence is known to the operator, and prime the synthesis of 
DNA proceeding in opposing directions causing amplification of the DNA segment of 
unknown sequence. The product of this reaction will thus contain two terminal segments 
of DNA sequence known to the operator (as described supra, and an internal DNA 
segment of unknown sequence. This amplified DNA molecule is known as the captured 
amplimer. 

[0180] The captured amplimer is analyzed for the nucleotide composition of the 

region whose sequence is unknown to the operator. This may be achieved by any of 
several methods known to those skilled in the art. Importantly for the invention, the 
sequence composition is not required to be determined in its entirety, rather a segment 
adequate to allow identification of its origins by comparison to a sequence database of 
known composition, e.g. GENBANK. 

^0181] The region of the captured amplimer that provides the sequence for ; 

comparison is known as the captured sequence. Comparison of the captured sequence to a 
database can be performed by any of several means known to operators skilled in the art, 
in this example using BLAST analysis. That portion of the captured sequence that can be 
matched to the sequence of genetic loci contained in the established database is referred 
to as the sequence tag. 

[0182] The sequence tag, once acquired and annotated with the corresponding 

genetic locus information and assigned a mean marker protein expression value, then 
may be used to correlate with a reference cell type. This identifies a potential drug target 
or a diagnostic indicator of a disease state or other diagnosable difference between the 
two cell types. 



B. 5' Serial Analysis of Viral Integration (5'SAVI\ 3' Serial Analysis of Viral 
Integration (3'SAVI) and Serial Analysis of Viral Integration (SAVI). Under 
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present invention, three new methods for identifying loci of the marker integration 

may be utilized 

[0183] The first method is called 5' Serial Analysis of Viral Integration (5'SAVI). 

This method is used to identify the sequence of cellular exons fused by the splicing 
mechanism to the 5 5 side of the marker gene, which can result in an encoded fusion 
protein. This method takes advantage of Type IIS restriction enzyme site incorporated at 
the 5' end of the marker, immediately downstream to the SA signal. Two different 
versions of this method are outlined in Figures 18A-18D. 

[0184] FIG. 18A-18B is a schematic depicting one of the possible experimental 

procedures to carry out the 5'SAVI method. This method starts with purified spliced 
RNA. Reverse transcription is employed to convert RNA transcripts into a 
complementary double stranded cDNA (cDNA) following standard methods for full 
length total cDNA synthesis. This cDNA is then subjected to restriction enzyme 
digestion with a Type IIS restriction enzyme which will cut the cDNA into the sequences 
corresponding to the cellular exon ten to twenty bases away from the SD/S A junction 
depending on which Type IIS restriction enzyme is used. A biotin-labeled primer #1 
with a sequence specific for the marker exon is then employed to generate a ssDNA 
fragment that extends into the cellular exon fused upstream of the marker exon. 
Collection of this biotin-ssDNA by streptavidin conjugated magnetic beads enriches these 
specific ssDNA for subsequent DNA terminal transferase reaction. Poly-deoxynucleotide 
can be added onto these ssDNA as a tail at their 3' end. An oligonucleotide primer 
complementary to the homopolymeric tail and a second primer #2 nested with respect to 
primer #1 on the marker gene can therefore be used to amplify by PCR this 3' end of the 
cellular exon fused to the 5 'side of the marker exon. These short tags or amplification 
fragments from different integrated genes can, by ligation reactions, be made into longer 
DNA fragments that are subsequently sequenced. Sequencing results of these tags can be 
used to retrieve the identity from a sequence database. 

[0185] Figure 18C-18D is a schematic depicting another possible experimental 

approach to carry out the 5' SAVI method. In this version of the 5'SAVI method, a 
biotinylated primer #1 specific to the complementary sequence of the marker exon is 
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used to prime cDNA synthesis by reverse transcriptase. Next, a polynucleotide tail is 
added to this single stranded cDNA by the enzyme terminal transferase. An 
oligonucleotide primer complementary to this homopolymeric tail is then used to drive 
the synthesis of the complementary second DNA strand. Double stranded products are 
subjected to a type IIS restriction enzyme digestion and the digestion products are 
purified with magnetic streptavidin beads. An adaptor is ligated to the end generated by 
the type IIS restriction enzyme and the products are amplified by PCR with a primer 
corresponding to the adaptor sequence and with a primer #2 specific to the marker gene 
and nested with respect to primer #1. Amplification products are ligated together into 
high order polymeric structures, cloned into sequencing vectors and sequenced. 
Sequencing results of these tags can be used to retrieve the identity from a sequence 
database. 

[0186] . The methods described above as 5'SAVI can also be applied to retrieve 
DNA sequence of cellular exons fused by the splicing mechanism to the 3 'end of the 
maker exon sequence. In this case, the marker exon contains a type IIS restriction enzyme 
at the 3' end of the exon marker immediately followed by a splice donor consensus 
sequence. Figure 18E-18F depicts the method of 3 'S AVI. Upon RNA transcription 
driven by a cellular promoter or by a heterologous promoter driving the expression of the 
marker gene, and RNA splicing, the exon marker sequence will generate a fusion to a 
downstream cellular exon. After total double stranded cDNA synthesis and digestion with 
a type IIS restriction enzyme, a double stranded DNA adaptor can be ligated to the end 
generated by the type IIS restriction enzyme. Amplification of fragments containing 
marker sequences can be accomplished by PCR amplification using a primer #1 specific 
to the marker exon sequence and a second primer corresponding to the ligated adaptor. 
After PCR amplification, fragments of equal length can be cloned into high order 
polymeric structures, cloned into sequencing vectors and sequenced. 

[0187] Alternatively, the marker gene may contain two recognition sites for a 

Type IIS restriction endonuclease as shown in FIG. 18G-18H. In this method, the 
construct that is inserted into the genomic DNA of a cell comprises a marker exon 
flanked by a splice acceptor and donor consensus sequences, a first Type IIS restriction 
enzyme recognition (RER) site located upstream of the marker, a second Type IIS RER 
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site located downstream of the marker, a first non-Type IIS RER site located between the 
marker and the first Type IIS RER site, a second non-Type IIS RER site located between 
the marker and the second type IIS RER site. During splicing, assuming that the 
construct has integrated into an intron in the same orientation as the direction of 
transcription of the cellular gene, the introns will be removed by the splicing mechanism 
generating a spliced RNA molecule with an insertion of the marker exon between two 
cellular exons. Then, mRNA is isolated from the cell, and reverse transcribed into full 
length double stranded cDNA. The cDNA is subjected to a Type IIS restriction enzyme 
digestion that recognizes the first and second Type IIS RER sites and thereupon cleaves 
the cDNA upstream of the first Type IIS RER site and downstream of the second Type 
IIS RER site such that a cDNA fragment is produced comprising the marker, and portions 
of the upstream and downstream exon sequences. After digestion with the appropriate 
Type IIS RE, the fragment is self-ligated generating a circular molecule, where the 
sequence tags from the upstream and downstream cellular exons are fused in inverse 
orientation generating a di-tag. This di-tag is then amplified by inverse PCR using 
marker-specific primers. Following amplification, the fragments are subjected to one or 
more non-Type IIS REs that recognized the first and second non-Type IIS RER sites and 
thereupon cleaves the fragments within each of the first and the second non-Type IIS 
RER sites such that the marker is cleaved away from the fragments. Following non-Type 
IIS RE digestion, the di-tag fragments are separated and ligated together to form a 
concatamer, which is then sequenced by appropriate methods. The sequence is then 
compared to a sequence database such that the protein encoded by the sequence is 
identified. The structure of vectors described here for use with the S AVI method can also 
be used to retrieve sequence information from only the upstream or downstream exons by 
following the above described methods of 5 'S AVI or 3'SAVI, respectively. As can be 
appreciated by one of ordinary skill in the art, this method allows for the quantitative 
elucidation of protein profile for a given cell coupled with the simultaneous identification 
of exon boundaries for all gene trapped genes. Since the length of the di-tags 
corresponding to the upstream and downstream exon boundaries captured by this method 
is the same for all genes, PCR amplification does not introduce a size bias and the 
frequency of a di-tag being amplified and sequenced will therefore reflect the relative 
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abundances of different mRNA transcripts. Therefore, the relative frequencies of di-tags 
being sequenced can represent the relative levels of transcription and mRNA abundance 
levels for all identified genes. The combination of different exon boundaries in the ditags 
from the same gene will provide information about alternative splicing from any given 
gene. 

[0188] The SAVI approach, as described above, has several advantages compared 

to other methods. First, it allows obtaining the sequence of about 8-10 viral integration 
events per sequencing reaction of contiguous exons, therefore lowering the sequencing 
costs. Second, it brings the sequence information of contiguous exons, thereby allowing 
to study alternative splicing and to annotate functional exon boundaries in the genome. 

[0189] The sequence obtained is analyzed by segmentation into defined lengths 

(established by the specificities of the enzyme used earlier and is known as a "captured 
viral integration sequence." Comparison of the captured sequence to the database can be 
performed by any of several means known to operators skilled in the art, in this example 
using BLAST analysis. That portion of the captured sequence that can be matched to the 
sequence of genetic loci contained in the established database is referred to as the 
"captured SAVI sequence tag." 

[0190] The "captured SAVI sequence tag" is then annotated with the genetic 

locus and the mean marker protein expression value and is denoted the "SAVI sequence 
tag." This information can be used as described earlier. 

[0191] The application of these technologies yields several important types of 

information. First, the ability to generate "captured amplimers" (from both genomic DNA 
and from cDNA) containing sequences of host cell DNA adjacent to the integrated virus 
provides data that may determine whether the carcinogenic effects of the virus are due to 
insertional mutagenesis, or more likely, to the expression of viral genes. This information 
may be particularly relevant in establishing drug regimens to block expression of the viral 
genes or to block specific changes in cellular gene expression resulting from site-specific 
HPV integration. 

[0192] Perhaps more importantly, the ability to quantify the frequencies at which 

particular sites of viral integration have occurred can provide information on the clonality 



-56- 



of analyzed lesions, potentially even from samples such as Pap smears. This is important 
because the presence of clonal populations of cells expressing the integrated viral 
transforming genes should correlate with the development of cancer in humans (as is seen 
in rodent models). 

[0193] Similarly, bronchoalveolar carcinoma in humans shows characteristics 

strikingly similar to tumors induced in sheep by a viral pathogen (Jaagsiekte virus) 
although no etiologic agent has yet been identified. The application of the methods of the 
invention in combination with preparation of EST libraries from bronchoalveolar 
carcinomas and surrounding normal tissues from patients with this disease could provide 
information on the causative agent in humans providing diagnostic/prognostic markers 
that can be exploited. Besides these clinical applications, these data acquisition and 
reporting systems can also be used to study the mechanism of alternative splicing and the 
gene expressions regulated in alternative splicing manner. The transcriptional levels of 
genes can also be digitalized and represented by the frequency of genes being captured. 
The product of these captured gene tags will be used as probes to hybridize a DNA 
microarray for data validation. 

Analysis of Data Obtained From Sequencing of Captured Gene Tags 

[0194] In an optimal embodiment the sequence tags and their associated 

fluorescence/mRNA levels will be used as input. This data will be analyzed 
concomitantly with publicly and privately available data. 

[0195] The resultant data can be imported into a proprietary database, or mined 

directly for quick comparison and pattern matching. This activity will result in a wide 
variety of information including but not limited to pharmacogenetic targets, pathway and 
metabolic analysis, comparison of protein expression between and within species, 
organisms, and cell states. 

[0196] For the purpose of this application, the term genetic locus is used to 

specify a particular location within the context of the genome and does not imply a 
complete transcription or regulatory unit, instead referring to a specific sequence which 
may comprise all or part of such functional units or sites. 
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[0197] For the purpose of this application, marker protein concentration refers to 

the concentration of specific individual protein configurations which result from 
phosphorylation, acetylation or other structural modifications which affect functional 
state (e.g., dimerization vs. monomer) in addition to any assumed unmodified peptide 
arising from the translation of a mRNA. 

Data Aggregation Process: 

[0200] The preferred embodiment of the aggregation process consists of four 

steps. These steps are: 1) matching the tag against its respective protein sequence, 2) 
associating a concentration or count level with the tag that is derived from data measured 
in the FACS module, 3) combining all of the available tag protein expression level data 
obtained from independent experiments performed with a certain cell type or line to 
obtain a statistical distribution of the protein expression profile for either each particular 
fusion protein or a composite value for each genetic locus that summarizes the protein 
expression data of all characterized fusion proteins between the marker gene and the 
different exons of the protein encoded by said genetic locus, 4) creating tables that 
represent the information for each tag and the composite information for each genetic 
locus, 5) statistical evaluations of differences between the distributions of protein 
expression levels associated to each tag obtained from the reference and test populations 
of cells. Steps one and two are order independent, meaning that step two can occur 
before step one without any problems for the process. 

[0201] Implementation of step one begins with the receipt of sequence tags (a 

variable length DNA sequence that will usually be between 16-25 bases) and associated 
marker protein concentration data (in this example, but not limited to FACS-derived 
data). Each tag is compared with a database that contains sequence information of the 
proteins in the organism of interest. Many methods of making this comparison are 
possible. Potential methods include but are not limited to: hashing algorithms, dynamic 
programming alignment algorithms (such as Smith- Waterman and Needleman-Wunsch 
alignment algorithms), suffix trees and arrays, inverted lists, and combined approaches 
(such as BLAST (combination of hashing and dynamic programming)) as well as any 
other string alignment algorithm. 
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[0202] The database can consist of annotated or unannotated genomic sequences 

that find expression in cells as RNA (independent of their translation into protein, e.g., 
snRNA, scRNAs, RNAs with catalytic activities, etc.), cDNA libraries, EST libraries, 
protein sequence libraries (including DNA sequences (with or without intronic or exonic 
sequences) and amino-acid sequences (including primary, secondary and/or tertiary 
structure information)). Examples of such databases would include the publicly available 
EST and genomic databases. The end result of the matching step is that every tag 
becomes associated with a genetic unit (including subdivisions thereof such as specific 
intron or exon within a transcription unit) or becomes marked as an unknown so that it 
can be run again as more information about the proteome/transcript becomes known. 

[0203] In step two, each tag has data associated with it that can be used to derive 

quantitative values of the corresponding protein expression levels. The derivation of this 
quantitative value consists of applying a formula (derived early in the process) to the raw 
fluorescence levels. 

[0204] Step three consists of taking the individual tag marker protein 

concentration levels for each individual genetic locus and combining them to create 
composite value/s for the genetic locus and those closely related loci (e.g., other introns 
or exons) within a transcription unit. (This value may have various statistical data 
associated with it including, but not limited to, measures of central tendency, variance.) 
The derived data represents a statistical profile of the protein concentration dependent 
upon the properties of the cells being surveyed (e.g., marker protein stability, transport, 
cellular auto fluorescence, etc.). The derivation process will be optimized for each 
organism. This optimization may incorporate a wide variety of methods including, but 
not limited to, comparison of the individual tag's marker protein concentration (in this 
example FACS-derived but not limited to, e.g. ferrous conjugate and electromagnetically 
fractionated) with protein levels as measured by other empirical methods (e.g., ELISA, 
NMR, 2-D gel electrophoresis) and the use of general biological knowledge about protein 
structure and regulation. These different methods will allow the determination of the 
accuracy of tags from different regions of proteins on a proteome wide level as well as a 
more detailed level (for example protein families, super families, proteins with any 
significant homology, and individual proteins). For those transcription units that produce 
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no translation products, marker protein concentration values of zero can be assigned. 
Although these genetic loci will not be useful for the determination of direct 
protein/genetic locus correlation, integrative studies can determine whether expression of 
these transcripts correlate with changes in the pattern of expression at other loci and/or 
participate in more global regulatory phenomena such alterations in the selection of 
alternative splicing sites, polyadenylation sites, cytoplasmic transport/stability properties, 
ribosome binding and other translational events, etc.) 

[0205] Step four consists of placing all of the input and consequently derived 

information into tables that are suitable for further detailed analysis or loading into a 
database. The resulting tables will be relational to support the use of analysis tools 
including, but not limited to, those found in standard OLAP applications, industrial 
engineering, operations research, artificial intelligence, forecasting techniques, clustering, 
genetic network inference and pathway analysis. Examples of such data analysis 
techniques include but are not limited to phylogenetic tree construction, k-means 
clustering, expectation maximization, self-organizing maps, support vector machines, 
various public-domain algorithms, as well as mathematical/statistical models like 
Boolean networks, applications of differential equations, and stochastic and hybrid Petri 
nets. 

[0206] Step five consists in the evaluation of statistical differences between 

reference and test cell lines. For example, to identify candidate genes involved in cancer 
biology, the method is performed for two tissue types: a test and a reference cell type, 
such as a normal cell and cancer cells or cell lines from the same tissue. The method of 
the invention enables quantification of a single protein species per cell in a population of 
cells from a given cell population line. These cells are then classified into several gates of 
different protein expression levels. Each experiment is repeated several times, yielding 
independent protein expression level measurements for each tagged protein. Thus, 
compilation of all this data will yield a distribution of discrete independent values for 
each particular fusion protein over several discrete gates for each cell line that was 
analyzed. Based on the analysis of these distributions as described below, it is possible to 
rank genes by P-value. One possible way to do this but not limited to, would be to do a 
randomization test for each gene, where the null hypothesis is that "there is no difference 
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between the protein level distribution of the gene in the test and reference cell lines". A 
statistic used to measure these differences is: 

t = if (cfij - oTij) X b where of , 7 = n^/tfj 

1-7 

where 

Xi = representative fluorescence value for gate i, i€ {1,2,3,4 G} 

n C jj - number of independent events observed in gate i for gene j in the cancer cells or cell line 
n n ij = number of independent events observed in gate i for gene j in the normal cells or cell line 
]S c ij = n c ij + n n ij 

= total independent events observed in gate i for gene j over both cells or cell lines 
Ncj^tfy 

= total number of independent events in the cancer cells or cell line over all gates 

i=l 

= total number of independent events in the normal cell line over all gates 

[0207] Under the null hypothesis, it is possible to assign a P-value to the observed 

value of the statistic using a randomization procedure. We randomly choose ISfj events 
from the pool of Nj(=N c j + N n j) total events and consider them to be arising from the test 
cell line and the rest from the reference cell line. For each such random assignment, the 
test statistic is calculated and recorded. This procedure is repeated for a large number of 
iterations (1000-10000) to get a distribution of the test statistic. The P-value of the 
observed test statistic can be easily calculated as the ratio of the number of points equal to 
or more extreme than the observed value to the total number of iterations performed. 

[0208] In the case where the distribution of protein expression levels is obtained 

from cloned cells where each clone bears a different exon trapping event, standard 
statistical methods can be used to evaluate differences between protein expression 
profiles for a particular fusion protein between the reference and test cell lines, obtained 
from the fluorescence intensity distribution measured by FACS analysis of the 
monoclonal population of cells bearing the same protein fusion. 
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Screen for Proteins and/or Protein Domains That Interact With a Protein of Interest 
By Using Methods of The Instant Invention 

[0209] The original process described herein and in the applications from which 

the present application claims priority consists in the comparison of protein fusion 
profiles between two different cell lines or cell lines subjected to different treatments or 
culture conditions. The protein fusions are obtained by gene trapping of cellular exons by 
insertion of retroviruses expressing a reporter gene (hrGFP, AcGFP, epitopes, etc) 
defined as an exon by functional flanking splicing signals, within introns of functional 
transcriptional units. The expression profile of the different protein fusions is determined 
by cell sorting of cells into different fractions based on their level of fluorescence and the 
identity of each gene trapping event is determined by sequencing the exon flanking the 
reporter exon. Each gene trapping event generates an in-frame fusion with the fluorescent 
reporter protein, resulting in a measurable signal. 

[0210] This process can be modified to include functional restrictions to the, 

generation of the fluorescent signal. This means that the generation of a positive 
fluorescent signal will require additional mechanisms or events in addition to an in- frame 
gene trapping with the reporter protein. 

[0211] For example, the system is designed so as to require that the fusion protein 

interacts with other protein or protein domains in order to generate the fluorescent signal. 
This can be accomplished by having a pair of reporter proteins or protein fragments that 
reconstitute some type of activity when brought together by protein interacting partners 
fused to each of these reporter subunits. A condition that must be met is that the 
interacting subunits of the reporter system do not drive the protein interaction but that the 
fusion protein domains fused to each subunit do. Examples of generation of fluorescent 
signal using these two subunit reporter systems are the enzyme complementation assay 
developed for jS-lactamase (PNAS 2002, 99:3469; Nat. Biotech. 2002, 20:619) or 
fluorescence resonance energy transfer (FRET) between CFP and YFP. Of these two 
systems /3-lactamase is the preferred one because it reconstitutes an enzymatic activity 
that can be more sensitive than the generation of fluorescence by FRET (See FIG. 25) 
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[0212] This method will provide a profile of fusion protein domains that interact 

with a target protein of interest (the "bait"). For practical purposes the bait fusion protein 
should be expressed in excess relative to the expression levels of the fusion protein 
domains generated by gene trapping. The intensity of the fluorescent signal (or other 
signal) will result from a combination of the fusion protein levels and the association 
constant between the interacting protein domains fused to each subunit of the reporter 
system. This method allows the generation of a library of protein domains able to interact 
with the target protein of interest as well as protein interaction values (determined by the 
mean fluorescence intensity) for each of them. In this way, a comparison between two 
different cell lines or cell lines cultured in different conditions would reflect the 
differences in protein interactions with a defined protein target. 

[0213] By way of non- limiting example, a target protein is selected whose 

function is necessary for tumor growth. The function of this gene may be an important 
target for development of small molecule drugs to inhibit tumor growth. However, the 
function of such a protein often involves and depends on interactions with other proteins 
and therefore an alternative is to target these protein-protein interactions with small 
molecule drugs. This requires a previous characterization of all the protein domains 
interacting with the target protein of interest. The characterization of proteins interacting 
with a particular target has been previously attempted by using yeast two hybrid analysis. 
However, as this technology reports protein interaction through the transcriptional 
activation of a reporter gene, the only interactions that can be mapped are those whose 
fusion products are transported to the nucleus. In contrast, the B-lactamase 
complementation assay described herein does not require a particular cellular localization 
of the two fusion products. 

[0214] In order to perform a characterization of the protein domains that interact 

with a target protein, the following steps should be performed (FIG. 25). 

[0215] First, a stable cell line expressing a fusion between the target protein (bait 

protein) and one of the subunits of the reporter system are constructed (FIG. 25A). 
Second, gene trapping should be performed on this stable cell line, using retroviral 
vectors that encode the second subunit of the reporter system in an exon acceptor 
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configuration as described herein. This will generate a library of gene trapped protein 
domains fused to the second subunit of the reporter system. Those cells that display a 
functional protein domain capable of interaction with the bait protein will reconstitute the 
reporter function and generate a fluorescent signal (FIG. 25B). Each of these cells can be 
sorted according to their fluorescence levels and analyzed for the identity of the exon 
fused to the gene trapping subunit (by 5'SAVI, SAVI or 5'RACE) either as a population 
or after isolation of individual clones. 

EXAMPLE 1 

Results and Descriptions of Vectors 
[0216] A polynucleotide construct (Gene Trap (GT) vector) was constructed with 

a splicing acceptor (SA) signal of human y-globin intron #2 in front of humanized renilla 
green fluorescence protein (hrGFP) to ensure that the hrGFP can be spliced into the exons 
of trapped genes (FIGS. 4-6). This SA-hrGFP then was inserted into a retroviral vector in 
an anti-sense orientation to avoid the interference of the transcription function of 5 ? LTR 5 
furthermore, the 3' LTR of this retroviral vector has been altered with a deletion of U3 
region. The duplication of this deletion in 3' LTR into 5 f LTR during reverse transcription 
disables the 5' LTR promoter function. Therefore, this vector becomes a self-inactivation 
(SIN) vector. For titer analysis and, to ensure the existence of GT vector in retrovirally 
transduced (infected) cells, a G418 selection marker gene (NeoR) driven by human 
cytomegalovirus intermediate-early (CMV IE) promoter was inserted in the vector after 
hrGFP followed by a bovine growth hormone polyadenylation signal (BGH pA). These 
genes and functional signals were constructed in reverse orientation to LTRs. The gene 
expression of hrGFP can only occur after this vector is integrated into the downstream of 
a cellular promoter. 

[0217] Similarly, several different exon trapping vectors have been constructed, 

the structural characteristics of which are outlined inJTable 1. 
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[0218] These vectors combine several structural features such as being based on 

either MoMLV or HIV-1 backbones, the use of different reporter marker genes (hrGFP, 
AcGFP, Neo, HA and V5 epitopes), the presence of either one or both splice acceptor and 
splice donor consensus sequences, the presence or absence or selectable markers, exon 
trapping markers in different translational reading frames and different restriction sites 
flanking the marker in order to allow for identification of the flanking sequence tags by 
using different methodologies described before, such as 5 'RACE, 3 'RACE, 5' SAVI, 3' 
SAVI, SAVI or inverse PCR. More specifically, as summarized in Table 1, these vectors 
are either based on the MoMLV or HIV-1 viral backbones; they contain different markers 
for exon trapping such as hrGFP, AcGFP, hrGFP-ires galactosyltransferase, Neo, or the 
epitopes HA or V5. Some of these vectors contain a selectable marker to select for 
transduced cells such as a-galactosyltransferase, Neomycin or Hygromycin resistance 
genes, and the expression of these markers is driven by either the viral LTR promoter or 
an internal promoter such as Adenovirus Elb, CMV or PGK. All vectors contain the 
reporter exon preceded by a splice acceptor consensus sequence at the 5' end of the 
reporter sequence, and some of them, are also flanked by a splice donor consensus 
sequence at the 3' end of the reporter exon. In some vectors the ATG translation initiation 
codon has been mutated to avoid reporting integrations that occur in the 5' UTR of 
cellular genes. The marker gene is translated in either frames 0, 1 or 2 as indicated. aSIN 
LTR makes reference to a deletion of critical enhancer/promoter sequences in the 3' U3 
region of the 3 ' LTR. pA indicates when a polyadenylation sequence is present in the 
vector downstream of the reporter gene. The restriction sites for type IIS restriction 
enzymes that flank either one end or both ends of the insertion exon is also indicated in 
Table I. Also it is indicated whether these vectors are amenable to analysis of flanking 
sequence tags by the SAVI procedure or by rescue of genomic insertion points by 
restriction digestion of genomic DNA, re-ligation and bacterial transformation. 

[0219] We have been able to show successful gene trapping with our GT vectors 

in murine fibroblasts, NIH3T3 (FIGS. 1 1 and 12) and PA317 cells (FIG. 8A-B), human 
lung cancer cells (FIG. 10A-B). Fluorescence-activated cell sorting (FACS) has been 
employed to separate the gene-trapped cell population which shows green fluorescence 
after 488 mm UV light excitation. The enrichment of these gene trapping events were 
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performed by cell sorting machine (Altra Cell Sorter, Beckman Courier Co, Miami Fla., 
USA) showed that 95% of cell population (FIG. 8A-B) were fluorescence positive and 
gene trapped since the hrGFP expression can only occur after this hrGFP gene was 
integrated downstream of a cellular promoter. FIG. 14A-D shows examples of normal 
(HMEC) and cancer (MCF7) breast cell lines transduced with vectors pGTFSO or pHSG 
which were sorted by FACS into different gates according to their levels of expression of 
the fluorescent protein marker fused to different cellular proteins by means of exon 
trapping events. Furthermore, in theory, these hrGFP molecules should be a fusion 
protein with a cellular protein in frame after splicing occurred to join cellular exons and 
hrGFP together. This hypothesis of splicing and fusion protein has been demonstrated by 
a construct pGT5Z (FIG. 6) using a Zeocin-resistance protein to hrGFP after splicing and 
translation (FIG. 7). RNA transcripts of hrGFP in gene-trapped population were also 
detected by RT-PCR method (FIG. 9). These results demonstrate that gene trapping 
events can be monitored on translational level by FACS and transcription level by RT- 
PCR analysis in this experiment. 

[0220] One important aspect of the invention is the high throughput platform to 

sort gene-trapped cells by using FACS which can sort 15,000 cells/second. The quantity 
and stability of trapped gene product, which is fused to hrGFP and becomes a hrGFP 
fusion protein, can be determined by the intensities of hrGFP in the cells in FACS 
analysis (FIGS. 1 1 and 12). The invention can therefore be applied to determine the 
cellular protein levels at high throughput manner and monitor most of the pathways of 
gene expression altered by the causes of diseases, such as cancer, viral infection, drug 
treatment and gene transfer in gene therapy or gene transfer research. Other reporter 
genes can be used to replace the hrGFP gene, in other experiment, rodent a- 1,3- 
galactosyltransferase gene, which is not expressed in a human cell, was used to 
demonstrate that gene trapping can be achieved by simple plasmid transfection in up to 
1% of the population (FIG. 13C) as well as by retroviral vector infection (FIG. 13B). The 
results of this experiment are shown in FIGS. 21 and 22. 

[0221] FIG. 21 is a depiction of a successful gene trapping in pGT5A-transfected 

PA317 cells. Ncol restriction site located at the 5 f end of hrGFP marker gene and an 
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EcoRI at the Oligo-dA primer were used as cloning sites for gene trapped sequence into a 
sequencing vector which was digested with Ncol and EcoRI. After BLAST searching 
against mouse EST database in GenBank, the sequence trapped by pGT5 A is a 99% 
match to a high mobility group protein, HMGI-C, a nuclear phosphoprotein that contains 
three short DNA-binding domains (AT-hooks) and a highly acidic C-terminus. 

[0222] Interest in this protein has recently been stimulated by three observations: 

the expression of the gene is cell-cycle regulated, the gene is rearranged in a number of 
tumors of mesenchymal origin and mice that have both HMGI-C alleles disrupted exhibit 
the pygmy phenotype. These observations suggest a role for HMGI-C in cell growth, 
more specifically, during fetal growth since the protein is normally only expressed in 
embryonic tissues. It is likely that the HMGI-C protein acts as an architectural 
transcription factor, regulating the expression of one or more genes that control 
embryonic cell growth. Since HMGI-C binds to the minor groove at AT-rich DNA this 
interaction could be a target for minor groove chemotherapeutic agents in the treatment 
of sarcomas expressing the rearranged gene. As can be seen, the invention successfully 
identified a potential oncogene with a demonstration of high translation level of this gene 
product indicated by high intensity of hrGFP fusion protein in FACS analysis (FIG. 8). 

[0223] FIG. 22 is a depiction of gene trapping of an exon with unknown 

biological function in pGT5A-transfected PA317 cells. Ncol restriction site located at 
the 5' end of hrGFP marker gene and an EcoRI at the oligo-dA primer were used as 
cloning sites for gene trapped sequence into a sequencing vector which was digested with 
Ncol and EcoRI. After BLAST searching against the EST database in GenBank, the 
sequence trapped by pGT5A is 95% match to a NCI_CGAP_Li9 Mus musculus cDNA 
clones, BF539247.1/BF533319.1, which have been found in the cDNA libraries from 
salivary gland and liver. As can be seen the invention successfully identified a gene 
without known biological function, but with a known high-level of protein production 
indicated by the fusion protein of this gene product and hrGFP in FACS analysis in FIG. 
8. These results indicate that this invention can correlate the translation level of genes to 
some other unknown or undefined DNA sequences for potential new discoveries of genes 
or targets responsible for diseases or cancers. 
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[0224] Additional examples of sequence tags obtained by gene trapping with the 

vectors pHSG and pGTIO from cells sorted into subgroups C, D, E and F according to 
their fluorescence values (see FIG. 14A-D) are presented in Table 2. 
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DnaJ (Hsp40) homolog, subfamily C, member 8 (DNAJC8), 
mRNA 


KIAA1068 protein (KIAA1068), mRNA 


ras-related C3 botulinum toxin substrate 1 (rho family, small 
GTP binding protein Rac1 ) (RAC1 ), transcript variant Rac1 , 
mRNA 


hypothetical protein FLJ20259 (FLJ20259), mRNA 


protein phosphatase 2, regulatory subunit B (B56), gamma 
isoform (PPP2R5C), mRNA 


KIAA0963 protein (KIAA0963), mRNA 


cDNA FLJ23733 fis, clone HEP14786 


mastermind-like 1 (Drosophila) (MAML1), mRNA 


chromosome 5 open reading frame 8 (C5orf8), mRNA 


GLE1 RNA export mediator-like (yeast) (GLE1L), mRNA 


leukocyte receptor cluster (LRC) member 4 (LENG4), mRNA 


cofactor required for Sp1 transcriptional activation, subunit 8 
(34kD) (CRSP8), mRNA 


similar to p37 TRAP/SMCC/PC2 subunit (LOC220792), mRNA 


core-binding factor, beta subunit (CBFB), transcript variant 2, 
mRNA 


pM5 protein (PM5), mRNA 


ras homolog gene family, member A (ARHA), mRNA 


cytosolic acyl coenzyme A thioester hydrolase (HBACH), 
mRNA 


Cip1 -interacting zinc finger protein (CIZ1), mRNA 


heterogeneous nuclear ribonucleoprotein M (HNRPM), 
transcript variant 1 , mRNA 


RU1 (RU1), mRNA 


eukaryotic translation initiation factor 4B (EIF4B), mRNA 


speckle-type POZ protein (SPOP), mRNA 


poly(rC) binding protein 2 (PCBP2), transcript variant 1 , mRNA 


Fanconi anemia, complementation group A (FANCA), mRNA 


— 

guanine nucleotide binding protein (G protein) alpha 12 
(GNA12), mRNA 


I s - 

CM 


CO 
CO 


O) 
CO 


CM 


I s - 


I s - 


CO 


o 


CO 
CM 


CD 

T— 


CO 
CM 


CM 


CO 
00 


CO 

I s - 




o 


00 


I s - 


o 

CM 


I s - 


CO 
CM 


CO 


CO 


CM 

T— 


CO 




CO 

I s - 


LO 

I s - 


CD 
00 


CO 

oo 


co 

CO 


CO 

o 


LO 

o 


LO 
O 


I s - 

o 




CD 


CD 


co 

CO 


CO 


co 

CO 


00 
CO 




CO 


LO 


CO 
LO 


o 

CO 


00 
CO 


I s - 


LO 

I s - 


o 

CM 
CM 


o 
I s - 


LO 
CM 
CM 


CM 
00 
CM 


CM 
LO 

CO 


co 


LO 


00 
I s - 
LO 


LO 
CO 


00 

o 


T— 

00 


O 
I s - 
CO 


CO 
LO 
CO 


o 


CM 

o 


CO 
LO 
LO 


CO 
CD 
LO 


CO 
00 
LO 


00 
CO 
CO 


00 

I s - 
I s - 


I s - 


I s - 

00 
CO 


CM 
CO 
LO 


co 

LO 
LO 


LO 
CM 
LO 


LO 
CD 
CM 


CM 
CM 


CO 
CO 


I s - 


LO 
LO 


1367 


LLL 


1017 


I s - 

LO 
CO 


o 

CD 
CO 


CO 

I s - 


CD 
I s - 
CM 


co 


CO 

oo 


1223 


CO 
x— 


CO 

I s - 
co 


I s - 
oo 


CO 

I s - 


I s - 

co 

00 


CM 
CO 


LO 

I s - 

00 


I s - 
co 

CO 


1456 


CM 
CO 
CO 


CM 
O 
00 


1145 


CO 

I s - 

LO 


1275 


1633 


4316 


2563 


3314 


1071 


2067 


1900 


O 
LO 
00 


5 


o 

CO 
LO 


3669 


CO 
CO 

I s - 


1268 


2692 


2423 


2926 


1836 


1282 


1189 


4399 


1146 


00 


CO 
00 




LO 
CM 


CO 
00 


CD 
CM 


CO 
CO 
CM 


co 

CM 




oo 

00 


CM 
00 
^- 


CO 
CM 


CO 


CM 




CM 
LO 


o 

LO 


CM 
LO 


CO 
CM 


CO 
CM 
CO 


T— 


00 
LO 


CO 
00 


CM 
CO 




NM 014280.1 


T — 

CM 
CO 
CO 
LO 

5 
z: 


NM 006908.2 


NM 017730.1 


NM 002719.1 


NM 014963.1 


Hs#S4284146 


NM 014757.2 


NM 006816.1 


NM 001499.1 


XM 050236.6 


XM 087781.2 


XM 087782.1 


NM 001755.1 


NM 014287.2 


NM 001664.1 


NM 007274.1 


NM 012127.1 


NM 005968.2 


NM 016329.1 


NM 001417.1 


NM 003563.1 


NM 005016.2 


NM 000135.1 


NM 007353.1 


O 


o 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


O 


o 


a 


O 


O 


O 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 


HSG 



carnitine palmitoyltransferase 1, liver (CPT1A), nuclear gene 
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hypothetical protein FLJ20657 (NPL4), mRNA 


nuclear distribution gene C homolog (A. nidulans) (NUDC), 
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guanine nucleotide binding protein (G protein), beta 
polypeptide 1 (GNB1), mRNA 


RNA binding motif, single stranded interacting protein 2 
|(RBMS2), mRNA 


CD9 antigen (p24) (CD9), mRNA 


karyopherin alpha 3 (importin alpha 4) (KPNA3), mRNA 


C3HC4-type zinc finger protein (LZK1 ), mRNA 


actinin, alpha 1 (ACTN1), mRNA 


sorting nexin 2 (SNX2), mRNA 


translocating chain-associating membrane protein (TRAM), 
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hypothetical protein MGC11257 (MGC11257), mRNA 


SMT3 suppressorof mif two 3 homolog 2 (yeast) (SMT3H2), 
mRNA 


Ras-GTPase activating protein SH3 domain-binding protein 2 
(KIAA0660), mRNA 


protein-L-isoaspartate (D-aspartate) O-methyltransferase 
(PCMT1), mRNA 


cleavage stimulation factor, 3&apos; pre-RNA, subunit 3, 77kD 
(CSTF3), mRNA 


protein O-fucosyltransferase 1 (POFUT1), mRNA 


hypothetical protein FLJ13910 (FLJ13910), mRNA 


similar to dJ462023.1 (novel protein) (LOC90529), mRNA 


tyrosine 3-monooxygenase/tryptophan 5-monooxygenase 
activation protein, zeta polypeptide (YWHAZ), mRNA 
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13(PSMD13), mRNA 


cytochrome b5 outer mitochondrial membrane precursor 
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hypothetical protein FLJ20420 (FLJ20420), mRNA 
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splicing factor 3a, subunit 1, 120kD (SF3A1), mRNA 
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pumilio homolog 1 (Drosophila) (PUM1), mRNA 


papillary renal cell carcinoma (translocation-associated) 
(PRCC), mRNA 


N-acetyltransferase, homolog of S. cerevisiae ARD1 (ARD1), 
mRNA 


DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 17 (72kD) 
(DDX17), transcript variant 1, mRNA 


BCL2-like 12 (proline rich) (BCL2L12), transcript variant 1, 
mRNA 


HLA-B associated transcript 8 (BAT8), transcript variant 
NG36/G9a, mRNA 


signal recognition particle 68kD (SRP68), mRNA 


NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 10 
(42kD) (NDUFA10), mRNA . 


spinocerebellar ataxia 2 (olivopontocerebellar ataxia 2, 
autosomal dominant, ataxin 2) (SCA2), mRNA 


DKFZP434C212 protein (DKFZP434C212), mRNA 


met proto-oncogene (hepatocyte growth factor receptor) 
(MET), mRNA 


KIAA0391 gene product (KIAA0391), mRNA 


nuclear receptor co-repressor 1 (NCOR1), mRNA 


Wolf-Hirschhom syndrome candidate 1 (WHSC1), transcript 
variant 3, mRNA 


Wolf-Hirschhom syndrome candidate 1 (WHSC1), transcript 
variant 8, mRNA 


hypothetical protein FLJ22678 (FLJ22678), mRNA 


nucleoporin 214kD (CAIN) (NUP214), mRNA 


FK506 binding protein 12-rapamycin associated protein 1 
(FRAP1), mRNA 


polymerase (RNA) II (DNA directed) polypeptide A (220kD) 
(P0LR2A), mRNA 


eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1 ), 
mRNA 


myotubularin related protein 3 (MTMR3), mRNA 


KIT ligand (KITLG), mRNA 
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putative nucleotide binding protein, estradiol-induced (E2IG3), 
mRNA 


similar to CG1577 gene product (LOC1 22704), mRNA 


S164 protein (S164), mRNA 


Lsm3 protein (LSM3), mRNA 


ariadne homolog 2 (Drosophila) (ARIH2), mRNA 


mitogen-activated protein kinase 12 (MAPK12), mRNA 


LOC124801 (LOC124801), mRNA 


hypothetical protein MGC2641 (MGC2641), mRNA 


cytochrome P450, 51 (lanosterol 14-alpha-demethylase) 
(CYP51), mRNA 


tyrosine 3-monooxygenase/tryptophan 5-monooxygenase 
activation protein, zeta polypeptide (YWHAZ), mRNA 


mitogen-activated protein kinase 8 interacting protein 3 
(MAPK8IP3), mRNA 


cytochrome b5 outer mitochondrial membrane precursor 
(CYB5-M), mRNA 


START domain containing 7 (STARD7), mRNA 


DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 6 (RNA 
helicase, 54kD) (DDX6), mRNA 


ARP3 actin-related protein 3 homolog (yeast) (ACTR3), mRNA 


similar to HN1 like (LOC90861), mRNA 


spectrin, alpha, non-erythrocytic 1 (alpha-fodrin) (SPTAN1), 
mRNA 


TNF receptor-associated factor 2 (TRAF2), mRNA 


LOC204827 (LOC204827), mRNA 


hypoxanthine phosphoribosyltransferase 1 (Lesch-Nyhan 
syndrome) (HPRT1), mRNA 


interleukin enhancer binding factor 3, 90kD (ILF3), mRNA 


proline and glutamic acid rich nuclear protein (PELP1), mRNA 


KIAA1041 protein (KIAA1041), mRNA 


mitogen-activated protein kinase kinase 2 (MAP2K2), mRNA 


solute carrier family 25 (mitochondrial deoxynucleotide 
carrier), member 19 (SLC25A19), mRNA 
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F-boxand leucine-rich repeat protein 11 (FBXL11), mRNA 


protein tyrosine phosphatase, non receptor type 2 (PTPN2), 
transcript variant 3, mRNA 


DKFZP564G2022 protein (DKFZP564G2022), mRNA 


Fanconi anemia, complementation group A (FANCA), mRNA 


KIAA1041 protein (KIAA1041), mRNA 


hook3 protein (HOOK3), mRNA 


polymerase (RNA) II (DNA directed) polypeptide E (25kD) 
(POLR2E), mRNA 


cofactor of BRCA1 (COBRA1), mRNA 


KIAA1 1 1 6 protein (KIAA1 1 1 6), mRNA 


Taxi (human T-cell leukemia virus type I) binding protein 1 
(TAX1BP1), mRNA 


procollagen-proline, 2-oxoglutarate 4-dioxygenase (proline 4- 
hydroxylase), beta polypeptide (protein disulfide isomerase; 
thyroid hormone binding protein p55) (P4HB), mRNA 


RAN binding protein 3 (RANBP3), transcript variant RANBP3- 
d, mRNA 


nuclear receptor co-repressor 1 (NCOR1), mRNA 


HLA-B associated transcript 1 (BAT1), transcript variant 2, 
mRNA 


v-crk sarcoma virus CT10 oncogene homolog (avian)-like 
(CRKL), mRNA 


bromodomain adjacent to zinc finger domain, 1 B (BAZ1 B), 
transcript variant 2, mRNA 


nuclear factor of activated T-cells 5, tonicity-responsive 
(NFAT5), mRNA 


pM5 protein (PM5), mRNA 


ribonucleotide reductase M2 polypeptide (RRM2), mRNA 


coxsackie virus and adenovirus receptor (CXADR), mRNA 


stress-induced-phosphoprotein 1 (Hsp70/Hsp90-organizing 
protein) (STIP1), mRNA 


cytokine induced protein 29 kDa (CIP29), mRNA 


paraspeckle protein 1 (PSP1), mRNA 
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hypothetical protein MGC15523 (MGC15523), mRNA 


death associated transcription factor 1 (DATF1), transcript 
variant 3, mRNA 


protein kinase, lysine deficient 1 (PRKWNK1), mRNA 


hypothetical protein FLJ10534 (FLJ10534), mRNA 


spectrin, alpha, non-erythrocytic 1 (alpha-fodrin) (SPTAN1), 
mRNA 


OLIGOSACCHARYL TRANSFERASE STT3 SUBUNIT 
HOMOLOG (B5) (INTEGRAL MEMBRANE PROTEIN 1 ) 
(LOC219846), mRNA 


scribble (SCRIB), mRNA 


Snf2-related CBP activator protein (SRCAP), mRNA 


clathrin, heavy polypeptide (He) (CLTC), mRNA 


spectrin, alpha, non-erythrocytic 1 (alpha-fodrin) (SPTAN1), 
mRNA 


KIAA0310 gene product (KIAA0310), mRNA 


MAP-kinase activating death domain (MADD), transcript 
variant 6, mRNA 


restin (Reed-Steinberg cell-expressed intermediate filament- 
associated protein) (RSN), mRNA 


scribble (SCRIB), mRNA 


chromodomain helicase DNA binding protein 1 (CHD1), 
mRNA 


chromodomain helicase DNA binding protein 4 (CHD4), 
mRNA 


chromosome 6 open reading frame 28 (C6orf28), mRNA 


Similar to kinesin family member C1, clone MGC:1202 
IMAGE:3506669, mRNA, complete cds 


villin 2 (ezrin) (VIL2), mRNA 


glycoprotein, synaptic 2 (GPSN2), mRNA 


clathrin, heavy polypeptide (He) (CLTC), mRNA 


SWI/SNF related, matrix associated, actin dependent regulator 
of chromatin, subfamily e, member 1 (SMARCE1), mRNA 


cell division cycle 2-like 2 (CDC2L2), transcript variant 7, 
mRNA 
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NM 130474.1 
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guanine nucleotide binding protein (G protein), beta 
polypeptide 1 (GNB1), mRNA 


calponin 3, acidic (CNN3), mRNA 


Sjogren syndrome antigen B (autoantigen La) (SSB), mRNA 


suppressor of Ty 5 homolog (S. cerevisiae) (SUPT5H), mRNA 


LOC204826 (LOC204826), mRNA 


RAN binding protein 7 (RANBP7), mRNA 


tubby like protein 3 (TULP3), mRNA 


hypothetical protein DKFZp547A023 (DKFZp547A023), mRNA 


hypothetical protein MGC15875 (MGC15875), mRNA 


solute carrier family 23 (nucleobase transporters), member 1 
(SLC23A1), mRNA 


ATP-binding cassette, sub-family B (MDR/TAP), member 1 
(ABCB1), mRNA 


hypothetical protein FLJ20651 (FLJ20651), mRNA 


KIAA0013 gene product (ARHGAP1 1 A), mRNA 


general transcription factor IIF, polypeptide 1 (74kD subunit) 
(GTF2F1), mRNA 


integrin, beta 5 (ITGB5), mRNA 


KIAA0793 gene product (KIAA0793), mRNA 


membrane component, chromosome 1 1 , surface marker 1 
(M11S1), mRNA 


centaurin, alpha 1 (CENTA1), mRNA 


transcription factor 3 (E2A immunoglobulin enhancer binding 
factors E12/E47) (TCF3), mRNA 


high-mobility group (nonhistone chromosomal) protein 14 
(HMG14), mRNA 


RNA helicase-related protein (RNAHP), mRNA 


similar to mouse Glt3 or D. malanogaster transcription factor 
IIB(AF093680), mRNA 


core-binding factor, beta subunit (CBFB), transcript variant 2, 
mRNA 


eukaryotic translation initiation factor 2, subunit 2 (beta, 38kD ) 
(EIF2S2), mRNA 
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HIV-1 Tat interactive protein 2, 30 kD (HTATIP2), mRNA 


suppression of tumorigenicity 13 (colon carcinoma) (Hsp70 
interacting protein) (ST1 3), mRNA 


chromobox homolog 3 (HP1 gamma homolog, Drosophila) 
(CBX3) transcript variant 1 , mRNA 


similar to HN1 like (LOC90861), mRNA 


LIM and SH3 protein 1 (LASP1), mRNA 


BCL2-associated X protein (BAX), transcript variant sigma, 
mRNA 


likely ortholog of mouse acinus (ACN), mRNA 


SUMO-1 activating enzyme subunit 1 (SAE1), mRNA 


KIAA1936 protein (KIAA1936), mRNA 


hypothetical protein FLJ22206 (FLJ22206), mRNA 


TAR DNA binding protein (TARDBP), mRNA 


keratin 19 (KRT19), mRNA 


nucleolar protein 1 (120kD) (N0L1), mRNA 


ADP-ribosylation factor GTPase activating protein 1 
(ARFGAP1), mRNA 


HSPC133 protein (HSPC133), mRNA 


topoisomerase (DNA) I (T0P1), mRNA 


PRKC, apoptosis, WT1 , regulator (PAWR), mRNA 


KIAA1814 protein (KIAA1814), mRNA 


transcriptional adaptor 2 (ADA2 homolog, yeast)-like 
(TADA2L), transcript variant 2, mRNA 


HLA-B associated transcript 3 (BAT3), transcript variant 2, 
mRNA 


CGI-148 protein (LOC51030), mRNA 


a disintegrin and metalloproteinase domain 9 (meltrin gamma) 
(ADAM9), mRNA 


ribosomal protein L13a (RPL13A), mRNA 


optic atrophy 1 (autosomal dominant) (0PA1), nuclear gene 
encoding mitochondrial protein, transcript variant 1, mRNA 


heterogeneous nuclear ribonucleoprotein M (HNRPM), 
transcript variant 1 , mRNA 
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[0225] This table shows the subgroup of cells from where each particular tag was 

obtained (Gate), their GenBank accession number (Accession), the position of the coding 
sequence start (CDS Start) and stop (CDS Stop) nucleotides within the mRNA, the 
number of aminoacids of the protein encoded by that particular mRNA (Protein Length), 
the position within that mRNA where the 5' end of the GFP marker exon was inserted 
(Insertion Point), the number of amino acids of the tagged cellular protein fused to the 5' 
end of the GFP marker protein (aa Fused to N-ter), the fraction of the protein fused to 
GFP (Fraction of Protein Captured) and the identity name of the tagged cellular protein 
(Definition). Other types of information associated to each data record not displayed in 
the abridged version of this table are: information on reported gene annotation, function, 
and chromosomal location, with the additional ability to link directly to NCBI web pages 
regarding such information. Also, information about measures of quality control for each 
collected tag are obtained in the process. Measures of internal quality control reported for 
the viewer include: quality of the sequencing run, base by base; and the sequence quality 
of the validation sequence which must precede all captured data tags. External quality 
control measures available to the viewer include a determination of the degree of local 
identity between data tag and the matched NCBI RefSeq cDNA, as well as the 
calculation of the length of the match. 

[0226] FIGS. 22 and 23 represent compilations of genes captured, sequenced and 

annotated using the described method with several gene-trapping vectors. FIG. 22 
displays the allocation of all captured proteins among different subcellular localizations 
as a percentage of the total number of captured proteins. The graph in FIG. 22 represents 
a total of 1271 data records and at least 109 subcellular localization descriptions which 
have been classified into one of 8 subheadings: unknown, cytoplasm, nucleus, membrane, 
cytoskeletal, organelle, extracellular and peripheral membrane. The vectors displayed 
include, HSG, pGTIO, pGTFSO, -1, and -2 (combined and denoted as FS012). The 
general trend of the subcellular distribution of the endogenous proteins identified by the 
chimeric fusion products appears consistent among different gene-trapping vectors in two 
different target cell lines, MCF-7 and HCT-15. FIG. 23 displays the putative functions of 
all collected gene-trapped products among the listed vectors and cell lines. The graph in 
FIG. 23 represents 1271 data records and at least 233 categories placed in one of 16 
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major subheadings for protein function. The data is displayed as the percentage of 
captured proteins contained in each subheading for protein function. A consistent trend 
in the distribution of captured proteins over various functional subheadings can be 
observed among the different gene-trapping vectors and target cell lines. The 
compilation of data records obtained by the described method can therefore potentially 
describe the global picture of the active proteome. Expansion of this dataset to include 
all detectable chimeric fusion products can be used to make comparisons regarding 
function and cell localization among individual proteins of the proteome within different 
tissues, cell types and phenotypes. The method of the present invention can also be used 
to compare the translational regulation of gene products among cancer and normal cells 
by comparing the dataset on protein versus conventional RNA microarrays. 
Additionally, the method may be used to characterize global changes to cell/tissue 
phenotypes by the analysis of shifts in the distribution pattern of the proteome among 
different subcellular localizations or protein functions. 

[0227] The foregoing description is presented by way of illustration and is not 

intended to limit the scope of the present invention as set out in the appended claims. 

[0228] All of the references cited herein are incorporated by reference in their 

entirety. 
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